Automatic generation of 3D motifs for classification of protein binding sites
© Nebel et al. 2007
Received: 16 February 2007
Accepted: 30 August 2007
Published: 30 August 2007
Skip to main content
© Nebel et al. 2007
Received: 16 February 2007
Accepted: 30 August 2007
Published: 30 August 2007
Since many of the new protein structures delivered by high-throughput processes do not have any known function, there is a need for structure-based prediction of protein function. Protein 3D structures can be clustered according to their fold or secondary structures to produce classes of some functional significance. A recent alternative has been to detect specific 3D motifs which are often associated to active sites. Unfortunately, there are very few known 3D motifs, which are usually the result of a manual process, compared to the number of sequential motifs already known. In this paper, we report a method to automatically generate 3D motifs of protein structure binding sites based on consensus atom positions and evaluate it on a set of adenine based ligands.
Our new approach was validated by generating automatically 3D patterns for the main adenine based ligands, i.e. AMP, ADP and ATP. Out of the 18 detected patterns, only one, the ADP4 pattern, is not associated with well defined structural patterns. Moreover, most of the patterns could be classified as binding site 3D motifs. Literature research revealed that the ADP4 pattern actually corresponds to structural features which show complex evolutionary links between ligases and transferases. Therefore, all of the generated patterns prove to be meaningful. Each pattern was used to query all PDB proteins which bind either purine based or guanine based ligands, in order to evaluate the classification and annotation properties of the pattern. Overall, our 3D patterns matched 31% of proteins with adenine based ligands and 95.5% of them were classified correctly.
A new metric has been introduced allowing the classification of proteins according to the similarity of atomic environment of binding sites, and a methodology has been developed to automatically produce 3D patterns from that classification. A study of proteins binding adenine based ligands showed that these 3D patterns are not only biochemically meaningful, but can be used for protein classification and annotation.
Structural genomics projects aim at high-throughput delivery of protein structures regardless of the state of their functional annotation. Moreover, roughly half of gene-products delivered by completed genomes of various organisms do not show sequence homology to existing proteins of known function. Therefore, structure-based prediction of protein molecular function is essential. Protein 3D structures can be clustered according to their fold in classes which usually have some functional significance, e.g. SCOP , FSSP  and CATH . More recently, researchers have investigated the detection of functional 3D patterns associated with active sites or/and atom interactions (see the Methods section for an explanation of the terminology used throughout this paper). These patterns may be based on secondary structures, such as the EF-hand domain and zinc fingers or sets of 3D positions of atoms involved either in H-bonds  or ligand binding [5, 6]. Moreover, tools have been developed which allow the detection of residue based 3D patterns within a protein structure [5–8] and the comparison of the 3D structure of binding sites with other proteins . Unfortunately, many patterns do not correspond to any specific function and the number of known 3D motifs is rather small compared to the number of sequential motifs captured in databases such as PROSITE  and InterPro . So far, 3D motifs have been mainly the result of some manual and experimental process, e.g. the Catalytic Site Atlas . Moreover, patterns are usually constructed on the basis of residues and are represented either by their Cα atoms, atoms interacting with ligands or all their atoms.
Although proteins are composed of amino acids which are very convenient and useful structural units for the analysis of proteins, ultimately chemical interactions happen at the atomic level. The associations of residues in physicochemical groups which are not mutually exclusive implicitly acknowledge the limitation of residue-centred approaches . We propose a new approach where 3D motifs are generated automatically and are based only on consensus atom positions without explicit reference to the residues to which they belong and the direct interactions they may have with ligands. In this work, the ligand binding sites of a protein are compared by superimposing the corresponding ligands. The similarity between ligand environments is then evaluated by calculating the number of atoms of the same type which share equivalent spatial positions. By converting that similarity measure into a normalised metric, a similarity matrix can be generated for a given set of proteins in order to permit clustering of their ligand binding sites. Subsequently, consensus 3D patterns can be produced to represent each of the clusters. Because the clusters can be shown to be associated with specific biochemical functions, protein structures can be compared to these 3D motifs in order to predict their function.
Initial clusters for each ligand
Proteins in PDB50
Core cluster sizes
The 3D patterns corresponding to the above 18 clusters were then compared in an all against all manner. Clusters with highly similar 3D patterns were merged into new 3D patterns, resulting in a reduction of the number of clusters from 18 to 13 [see cluster composition in Additional file 1]. Merging was performed between AMP and ATP, ADP and ATP, but not between AMP and ADP clusters. The similarity threshold associated with each pattern was defined as the lowest similarity score between any pair of binding sites belonging to the cluster represented by that pattern. These values are called the automatic similarity threshold.
Consensus information about each pattern
(846, 847, 848)
2.7. (2.7.1., 2.7.4.)
Ligase (60%) Transferase (40%)
Actin (85%) Heat shock (15%)
(973 974 975)
Table 2 demonstrates that all 3D patterns, except ADP4, are associated with well defined structural patterns as described by the CATH and DALI identifiers. This observation validates our similarity metric which is based on the number of common atoms in the neighbourhood of a ligand, as a meaningful metric for protein structure comparison. Since 7 out of the 13 3D patterns also combine consensus function and sequential motifs (ADP0, ADP1, ADP6, AMP0, ATP0, AxP1, AxP2), we can classify them as binding site 3D motifs. Furthermore, although neither ADP2 nor ADP5 are associated with a specific function, each is related to a sequence motif. Therefore they must be linked to sub-functions which are performed by ADP binding and consequently can be classified as 3D motifs. Finally, ADP3 can also be considered as a 3D motif since it has a clear function related to its EC number and it belongs to a unique CATH homologous superfamily. DALI classifies the transferases of ADP3 into two very different folds, which suggests the local similarity of their binding sites is lost in DALI's global structure comparison process.
3D patterns detected by other systems
CSA 3D Template
Since CSA recognised only 1 out of our 13 templates, we investigated if they could potentially be detected by other methods. The adenine binding sites of the representative proteins were submitted to the SuMo server for 3D searches for functional sites  and to the PINTS server find 3D local structural patterns , which compared them to all ligand binding sites in the PDB . SuMo only detected strong similarities between the binding sites of the core proteins associated with the ADP0 patterns: 5 proteins were ranked very high and only 1 protein was not matched by the site. For PINTS, all core proteins of ADP0 and ATP0 ranked very highly against their cluster representative. The existence of a common pattern was also detected with ADP6 where only 1 protein out of 5 was not highly ranked. Moreover, some evidence of binding site similarity could be collected for ADP1 and AxP0, since respectively 4 out of 7 and 7 out of 13 proteins (all of which are actins) were shown to share similar sites. Results from these three servers confirm that our metric can be used for generating meaningful patterns, most of which are 3D motifs. Moreover, the fact that only 6 of our 13 patterns could potentially have been detected by these systems suggests that our metric captures some new important features from binding sites.
Contrary to all the other patterns, ADP4 does not correspond to a class with a common function, structure or sequence pattern. The analysis of the properties of the five proteins which comprise its core reveals that three of them are ligases (1ehi, 1e4e & 1gsa) and two are transferases (1kjq & 1iah). Structurally however, the 1ehi, 1e4e, 1gsa and 1kjq ATP-binding domains are classified by SCOP  as belonging to the ATP-grasp fold and the glutathione synthetase ATP-binding domain-like superfamily and share other structure and sequence patterns such as:
CATH 3.30.1490.20, FSSP 1053, 1055 & 1058, IPR011761 and PS50975
The 1iah structure, however, is classified by SCOP as a MHCK/EF2 atypical kinase belonging to the protein kinase-like superfamily and fold; its structure and sequence patterns, namely CATH 184.108.40.206, FSSP 601 & 2482, IPR002111, IPR005821 and IPR004166 do not match the ones for the other ADP4 proteins.
The VAST structure similarity server  does not detect any structural neighbour of 1iah within the cluster. Neither does the multiple sequence alignment of these five proteins with either ClustalW , MUSCLE  or TCoffee  show any sequence pattern shared by this cluster. However, their structural alignment constructed using our system highlights the following pattern: [EQ]-X-[ACVY] [MLV], which is completed by two non-polar residues and a conserved Lysine, [FI] [VLI] [K] 36–72 residues upstream. Multiple comparisons with either the MSDfold  or CE_MC  structure similarity search engines do not detect this pattern.
Taking into account that 1iah structure has a protein kinase-like fold, it is intriguing why our method clustered the 1iah ATP binding site together with structures from the ATP-grasp fold rather than with other protein kinases clustered in AxP1 (Table 2). ATP molecules bind in the cleft between the N- and C-terminal lobs in all ADP4 and AxP1 proteins. Detailed structural analysis  revealed that 1iah has a chimeric structure where the N-terminal domain is very similar to domains of classical protein kinases whilst the C-terminal one is similar to domains of the ATP-grasp fold. In that respect the 1iha structure makes a link between protein kinase and the ATP-grasp folds, which explains why its ATP-binding site was clustered together with those belonging to the ATP-grasp fold. Interestingly, the remote similarity between classical protein kinase folds and the ATP-grasp fold had been noted previously and explained using the concept of either convergent  or divergent  evolution. We thus believe that the detected pattern is meaningful.
Total number of PDB entries per ligand and matches against the generated 3D patterns
PDB entries containing ligand
Hits against the 13 patterns
PDB50 entries in valid clusters
During the search procedure those proteins whose similarity was higher than the automatic similarity threshold were classified as positive hits, see Table 4. For each of these proteins, its annotations, i.e. functions, EC numbers, fold classifications and sequence motifs, were compared manually with those of the pattern and/or of the proteins used to define the pattern. If there was an exact match, the protein was set as true positive; otherwise it was set as false positive. Since this validation required human expertise, only the positive hits were analysed.
Approximately 33% of ADP and ATP proteins and 15% of AMP proteins matched the 3D patterns, see Table 4. These rates correspond to the percentage of binding sites in the PDB50 set, which were originally present in valid clusters when patterns were generated, see Table 4. The rate for ATP proteins is, however, higher than expected because their binding sites often match patterns generated from ADP clusters, see Figure 5. Since physicochemical properties of ANP and ACP are very similar to ADP, proteins binding these ligands were very well classified using motifs generated independently from their binding sites. The high percentage of hits against the patterns for ACP binding sites is not significant because the ACP sample is particularly small (29 entries). As a whole, our 3D patterns matched 31% of proteins with adenine based ligands. Since guanine and adenine molecules are chemically very different, proteins with guanine based ligands were not expected to be matched by our adenine patterns. In fact, only two hits were produced from more than 300 of these proteins. Finally, the rate of false positive is very low (4.5%).
These results were obtained using automatically generated similarity thresholds, which proved quite conservative. By setting these thresholds manually to optimise the number of proteins matched by each pattern, the number of adenine proteins matched was increased to 38% while maintaining a low rate of false positives (6.5%).
Although some patterns such as AMP0, AxP2, ADP6 and ATP0 mainly detect the very small number of proteins which were used to produce them, other patterns such as ADP2, ADP3, AxP0 and AxP1 are able to hit a large number of proteins with a low number of false positives. These 3D patterns, therefore, show good potential for protein annotation.
Although DALI would be a better function predictor on this dataset – 100% success rate -, it relies on a very dense sampling of the known protein structure space. Indeed, the annotation of the 281 protein structures containing adenine based ligands required 114 different structure representatives, which correspond to 40.5% of the number of structures to annotate. Since our 13 3D motifs contain only a combined total of 368 atom positions, there is a 100-fold difference between the amount of data stored to produce those predictions. This supports our claim that our motifs capture atom positions which are key to binding site activity.
Since our method can be applied on any ligand containing some rigid structure, many other ligand families could be studied. 3D motifs could be generated, for example, from haem and chlorophyll groups, monosaccharides (e.g. glucose, mannose, frucose, galactose and NAG) and other common rigid ligands such as PLP, FMN, PCA and MES since they are present in many PDB entries.
We have presented a new metric which permits the classification of proteins according to the atomic environment of binding sites. The only constraint is that their ligands should contain some rigid structure so that they can be superimposed. By studying proteins binding adenine based ligands, we have demonstrated that the core of the generated clusters are biochemically meaningful and can be used to generate useful 3D motifs. We have shown that these motifs are efficient for protein classification and annotation since their false positive rate is low. In addition, they are quite ligand specific. Since our method was able to rediscover a pattern in the case of ADP4 revealing complex evolutionary links between two classes of proteins, we believe our technique could also be used to detect cases of convergent evolution.
In future work we plan to develop a software tool which would permit fast and efficient parsing of 3D protein structures to detect putative binding sites corresponding to our atom based 3D motifs.
Atom – in this work we only consider non-hydrogen atoms. Their coordinates are retrieved from PDB files.
Ligand – molecule interacting with a protein in either a non-covalent or a covalent fashion – prosthetic group. In this work we only consider ligands containing rigid 3D structures such as aromatic rings.
Ligand binding site – subset of protein atoms that are situated within a distance of 5.0 Å from at least one ligand atom. Since proteins often contain several binding sites involving the same type of ligand, only one binding site per PDB entry is considered in order not to introduce any bias during the active site clustering process.
3D pattern – consensus atom positions generated from superimposition of ligand binding sites.
3D motif – 3D pattern associated with a biochemical function. In this work, biochemical function is defined by either a consensus functional annotation or a consensus sequence motif.
Our method comprises the following steps:
generation of ligand-specific training sets of binding sites,
comparison of the binding sites of the training set proteins,
clustering of these proteins according to the results of their binding site comparison
generation of patterns representing each of the clusters.
For the purpose of 3D pattern generation the PDB50%, a set of PDB protein structures trimmed so that no single pair of proteins has sequence identity higher than 50% , was used. PDB50% was chosen because it offers a good compromise between providing a sufficient number of PDB entries for each ligand of interest and preventing the dataset to be too biased by close homologs. For each of the main adenine based ligands, namely AMP, ADP and ATP a training set was constructed, representing a set of ligand binding sites extracted from the subset of PDB50% that binds a particular ligand. The sizes of the training sets generated on 6th December 2005 are presented in Table 1.
For the purpose of classification evaluation we used a set of protein structures that bind purine based ligands, namely AMP, ADP, ATP, ANP, ACP, GMP, GDP and GTP, selected from the entire PDB set. The numbers of PDB entries for each of these ligands are presented in Table 4.
Binding sites of the training set were compared to one another in one-to-one comparisons. Binding sites were superimposed by performing rigid transformations between them according to the atom positions of the rigid structure of their ligands. An algorithm developed by Horn  was implemented to determine the translation and rotation that will align atoms in one coordinate system to corresponding atoms in another coordinate system, while minimizing the RMSD between the two fitted sets of atoms. Subsequently, for each pairwise comparison those atoms that could not be paired within a given threshold with atoms of the same chemical type (carbon, oxygen, nitrogen or sulphur) were discarded.
where CAB is the number of atoms within the pairwise consensus ligand binding site and TA (resp. TB) is the number of atoms in protein A (resp. B) within its ligand binding site.
The above similarity matrix S was then used to cluster the set of proteins.
The clustering process is two-fold: first, binding sites are clustered and secondly the generated clusters are pruned to remove outliers. The first stage is achieved by a state of the art general purpose graph-partitioning based algorithm, CLUTO [27, 28], which has already been used in a variety of bioinformatics applications [29–33]. Since CLUTO is a partition algorithm, its result depends on the number of partitions (k) and the final clusters may contain outliers. However, these potential limitations do not significantly affect our system. Since the aim of the clustering process is the generation of tight protein clusters which will be subsequently used for 3D pattern generation, not all proteins need to belong to a cluster. Therefore, clusters are post processed so that only cores of the most compact clusters are kept, see next paragraph. Moreover, our method allows clusters to be merged at a later stage of the process. Therefore, the choice of k is not a critical step. In this study k was set at 10% of the number of sites in the training set so that the clusters generated were big enough to generate meaningful consensus patterns while their internal similarity remained high. Cluster quality could be monitored since for each cluster CLUTO provides its average internal and external similarities and their associated standard deviations.
The second stage of the process is a pruning process, which removes outliers from each cluster. Since binding site relationships are expressed by a similarity matrix, outlier detection is achieved by a distance-based technique , where outliers are defined as data points whose similarity to the data centroid is lower than a given threshold value. Since the clusters generated by our method are rather small – less than 100 members – outliers may impact significantly on the centroid position. Consequently, our method employs an iterative process, which implicitly calculates the new position of the centroid after an outlier is removed from a cluster.
Site i is an outlier, if and S i <S T
The general idea behind the generation of a 3D pattern representing a given cluster is to superimpose all the binding sites of a given cluster and generate a cluster consensus ligand binding site. Then the 3D pattern is tested against all binding sites of the training set to evaluate if it is representative of the associated cluster. This task is performed by ranking all binding sites according to their similarity with the pattern of interest after superimposition of their ligands. The top hits should consist of the binding sites of the cluster members – true positives. If there is any false positive, the pattern is optimised so that the rank of the first false positive is the lowest. Finally, when a pattern is generated for a given cluster, its similarity to the last true positive is stored as a conservative threshold which will permit the automatic decision if an active site can be annotated with the properties associated with the cluster.
Consensus patterns for sequences and secondary structure elements are usually generated by a hierarchical process of pairwise comparisons . However, since atom positions consist of continuous values, pairwise comparisons are not transitive. Therefore, a 3D pattern generation can only be generated after performing all pairwise superimpositions of binding sites which are present in a given cluster. After each comparison, only consensus atoms are kept within the pair of binding sites. This is performed using the technique described in the 'binding site comparison' section. At the end of the process, each binding site is only composed of atoms whose positions are similar within the whole cluster. The 3D pattern is then generated by averaging atom positions from all the binding sites.
The process of generation of a pattern P is described below in pseudo code where n is the number of binding sites, BS, in a given cluster:
For i = 1 to n-1
For j = i+1 to n
Generate consensus binding sites Ci & Cj
BSi = Ci
BSj = Cj
For i = 1 to n
P = BSi/n
A more detailed description of the 3D pattern generation technique is given in .
We define a "good quality pattern" as a pattern containing the minimum number of atoms which allows the discrimination of the binding sites of the cluster from all the binding sites of the training set. While cluster consensus patterns contain the minimum number of atoms whose positions are shared among the binding sites of the cluster, they may not be fully discriminative, i.e. there may be at least one false positive ranking higher than the last true positive. In such cases, new less compact patterns are produced to achieve better binding site discrimination. A new cluster pattern is generated using the process previously described without including the binding site which is the furthest away from the cluster centroid. Although this binding site was not involved in the pattern generation, the discrimination power of the pattern is still evaluated for all active site in the cluster. This process is iterated until a fully representative pattern is generated or only two active sites are left for pattern generation. In the latter case, the most representative pattern is retained. Although this process does not ensure that selected patterns provide the best representation of a cluster, in practice they prove extremely good for binding site discrimination (see 'Results and discussion' section).
Once patterns are generated for all clusters, they are superimposed in a pairwise fashion to calculate their similarity. The clusters associated with highly similar patterns are merged and new 3D patterns are generated for each of the new clusters.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.