Design of the protein shape fingerprint 3DP
In view of a global analysis of the PDB we set out to identify a fingerprint encoding the 3D-structure of biomacromolecules since it is known to play an important role in their biological function, their interaction with other molecules, [22–25] and their evolution [26–29]. While 3D-SURFER uses a 121-dimensional scalar fingerprint to encode the shape of the molecular surface of proteins using 3D Zernike descriptors, [13, 14] we searched for a simpler yet more detailed encoding considering not only the shape of the molecular surface, but also the atom types and the internal structure of the protein. Inspired by the concept of atom-pair fingerprints proposed by Carhart, [30] Sheridan [31] and Schneider [32] to encode pharmacophores in small organic molecules, we recently reported a detailed analysis establishing the suitability of atom-pair fingerprints for 3D-shape and pharmacophore similarity searches in very large databases such as ChEMBL [33] and ZINC [34] using both topological distances read from the 2D-structures [35] and through-space distance read from the 3D-structures [20]. While topological distances cannot be extracted easily from the structures of macromolecules or even do not exist within non-covalent assemblies, an atom-pair analysis of macromolecules should be possible using through-space distances which can be directly computed from the atomic coordinates available in each pdb file. An atom pair 3D-fingerprint, here called 3DP, was therefore envisioned to encode 3D-structures in the PDB.
The 3DP fingerprint was designed to encode the 3D structure of proteins and other biomacromolecules in PDB considering the biological assembly as defined by the authors in each entry. Following our previous approach for small molecules, each atom pair distance was converted to a Gaussian centered on the atom pair distance with a width of 18 % of the distance itself, and the Gaussian was sampled at 34 values between 1.45 Å and 342.53 Å covering all atom pair distances present in PDB (Fig. 1a). For each of the 34 values increments were summed across all atom pairs. The 34 resulting sums were normalized to the heavy atom count to the power of 1.5 (HAC1.5) to reduce sensitivity to size (Fig. 1b). The 34 values were computed separately for four different atom categories, using the corresponding sum of category atoms as HAC for normalization. These four categories considered as important for macromolecular properties comprised 1) all atoms; 2) positive charges: lysine and arginine side chains; 3) negative charges: aspartate and glutamate side chains, phosphate groups; 4) hydrophobic atoms, defined here as carbon atoms with covalent bonds only to C or H atoms. Due to the smaller number of charged and hydrophobic atoms compared to all atoms the bit values in the corresponding categories were on average much smaller and were therefore multiplied by 5 for charged atom categories and by 2 for the hydrophobic atom category. The final bit values were expressed in percent and rounded to the integer value. The four combined sets of normalized, rounded values formed the 136-dimensional 3DP fingerprint.
Because most PDB-entries contain thousands of atoms which would require explicit calculation of many millions of atom pair distances (Fig. 1d), the 3DP calculation was simplified by computing the fingerprint using formal “sum atoms”. A maximum of 1728 sum atoms (resulting in a maximum of 1.5 million sum atom pairs) were defined for each PDB-entry by fitting its biological unit into a 12×12×12 grid, using the average coordinates of each atom category within each box as the sum atom coordinates and the number of category atoms within this box as its relative weight. 3DP bit values computed with sum atoms were essentially indistinguishable from those computed with actual atoms.
The 3DP calculation was performed 91,223 X-ray structures downloaded from the PDB in September 2014 (Additional file 1: Supplement 1), considering in each case the biological assembly as defined by the authors [36]. The 3DP fingerprint had a high resolution, with 99.99 % of PDB molecules having unique 3DP fingerprints. The average bit values peaked at 33.76 Å (B20, B122) for all atoms and hydrophobic atom pairs and at 39.83 Å (B55, B89) for positive and negative charge atom pairs, while the corresponding standard deviation covered almost the entire bit value range (Fig. 1c).
3DP encodes protein conformations
The 3DP fingerprint had a remarkable ability to precisely encode the shape of proteins, as evidenced by investigating the correlation between 3DP similarity and the root mean square deviation (RMSD) [37, 38] in different conformers of the same molecule, as presented with the following three test cases. As a first test case a random coil 24-mer peptide was obtained from PDB-entry 4GOF and simulated using the ff99SB force field from the AMBER12 package, [39] with 1 ns simulated annealing followed by 50 ns free simulation in water solvent, generating a large variety of conformers. The simulation was repeated 50 times, and in each case the last structure was selected as reference. Its conformer analogs were defined as all conformers with RMSD lower than 2 Å to that reference, which comprised up to 136 conformers that were always the last series of frames in the MD run. Retrieval of these conformers from the 2000 conformers of the corresponding trajectory was then performed by sorting using the city-block distance CBD3DP as similarity measure. The recovery was excellent, with an area under the curve (AUC) of 95.85 % for average receiver operator characteristic (ROC) curve, showing that 3DP indeed provided a good encoding of molecular shape in terms of similarity searching (Fig. 2a). All RMSD analogs were found within CBD3DP < 519, while the average distance of these RMSD analogs from the query conformer was 264 and the largest distance was 2164 (Fig. 2b).
The correlation between 3DP similarity and RMSD was tested in a second case for a larger protein by considering ten domain movement frames for glutamine binding protein (1762 atoms) from the Protein Motion Database [40, 41]. These ten conformers represent different conformations of the binding domain spanning between open (PDB-entry 1GGG, purple structure in Fig. 3a) [42] and closed state (PDB-entry 1WDN, red structure in Fig. 3a) [43]. A good correlation was observed between 3DP similarity and RMSD (calculated with all-atom alignment by Maestro 8.5 [44]), in particular when considering conformer pairs from one extreme of the movement range (purple and red lines in Fig. 3b). Representation of the bit value changes showed that 3DP perceived the conformational change at the level of each of the four different atom type categories (Fig. 3c).
The sensitivity of 3DP similarity to slight changes in protein conformation was evaluated in a third test case considering CDK2 (cyclin-dependent kinase 2), an important player in cell cycle regulation [45]. The activity of CDK2 is induced by conformational changes occurring upon ligand binding, in particular at the level of the large T-loop [46]. A total of 245 X-ray structures of CDK2 monomers bound with various small molecule inhibitors are available in the PDB (Additional file 1: Supplement 2). These CDK2 structures are nearly identical in sequence (>97 % identity), but show different conformation at the level of the T-loop due to an induced-fit adaptation of the protein around the various ligands. Although these 245 CDK2 structures were almost identical in terms of their global molecular shape, the 3DP analysis readily distinguished between the different T-loop conformers, with pairwise CBD3DP distances between 75 and 1415 CBD3DP units and maximum population around 300 ≤ CBD3DP ≤ 500 (Fig. 4a). Three different pairs were analyzed closer in terms of differences in their 3DP values (Fig. 4b) and their overall structures superimposed by all-atom alignment (Fig. 4c) [47]. In terms of the 3DP values differences were visible in all four atom pair categories. In terms of the protein structures, the closest pair 3QZH/3ROY at CBD3DP = 89 and the intermediate pair 3QRT/2C5Y at CBD3DP = 424 only significantly differed at the level of the T-loop. For the 3QQH/4EZ7 pair at CBD3DP = 1021, deviations were not only visible at the level of the T-loop but also in the α-helix and β-sheet at the top of the structures. The CDK2 analysis clearly showed that 3DP was able to perceive small conformational differences residing in a loop orientation between otherwise highly shape similar proteins.
3DP analysis of the RSCB protein databank
To facilitate access to PDB-entries a graphical user interface was created based on an interactive color-coded map representing the 3DP chemical space using the Mapplet principle previously reported for small molecule databases, by which database entries are displayed in a visualization window as the mouse cursor moves on the map, and connects to a fingerprint similarity search window to allow nearest neighbor searches [15–18]. Although 78 % of the data variability was represented in (PC1, PC2)-plane obtained by principal component analysis (PCA) of the 3DP dataset, this direct PCA map contained many scattered pixels with uneven occupancies and was not suitable as an interface (data not shown). We therefore generated an alternative representation of 3DP by similarity mapping, which produces more compact and evenly populated maps with a fairly good rendering for various chemical spaces [18, 48, 49].
To create the similarity map, 3DP similarity values relative to 200 randomly selected reference molecules from the PDB were calculated, and the (PC1, PC2)-plane obtained by PCA of the resulting 200-dimensional 3DP-similarity fingerprint was represented. This (PC1,PC2)-plane, called similarity map, covered 98 % of the similarity fingerprint data variability and had a compact, comet-like shape without peripheral pixels. Furthermore the map had a relatively even occupancy of pixels which was highly suitable as a representation of the PDB in 3DP space (Fig. 5a). PDB-entries were distributed on the map according to their size along the edge of the comet (Fig. 5b). The position on the map was influenced by the molecular volume occupancy (mvo, Å3/atom), a property defined here as the volume of the sphere with a radius corresponding to the average distance of all atoms to the center of gravity, divided by the total number of atoms, with the more compact, globular structure present at the comet edge and less compact structures at the center (Fig. 5c). The map also separated PDB-entries with different fractions of positively charged atoms, negatively charged atoms, and hydrophobic atoms, whereby the distribution pattern for the fraction of hydrophobic and positively charged atoms was somewhat similar (Fig. 5d–f). Color-coding by the normalized principal moments of inertia vector (nPMI1, nPMI2) distinguishing between rod-like, disc-like and sphere-like shapes, [21] showed that rod-like PDB-entries were located at the comet center, while the more spherical entries were distributed around the rest of the map (Fig. 5g).
The color-coded similarity maps were combined with the 3DP-similarity search tool into a web-based application called “PDB-Explorer” for interactive visualization and 3DP-similarity search through the entire PDB (Fig. 6). The website uses the same principles as our previously reported MQN-mapplet, [15–17] and is based on the open-source project visualizer (http://github.com/NPellet/visualizer) that was already successfully used for another cheminformatic project [50]. The PDB-Explorer consists of a main window to browse through various color-coded rendering of the 3DP-similarity map. The average PDB molecule in each pixel is shown in the 3D-viewer of PV, [51] and the “Show Bin” table details the contents of any selected pixel on the map from which each PDB-entry, shown as ribbon image, can be inspected closer as 3D-model by opening a secondary JSmol window. The “Locate Molecule” function allows locating any given PDB-entries on the map and the “Similarity Search” option offers nearest neighbors of any PDB-entry in the 3DP space within seconds. The application also contains uploading function to locate and search nearest neighbors of any structure represented as pdb file. All computations are performed locally by the browser of the user machine. The color-coded similarity maps and the database for similarity search are updated every day by adding new entries in the PDB. The detailed PDB-Explorer functions are described in the HELP page.
Exploring PDB using the PDB-Explorer
The PDB-Explorer allows the rapid analysis and overview of the entire PDB as well as detailed searches around selected PDB-entries using 3DP similarity as a guiding principle. Its use is exemplified with three case studies detailed below which further demonstrate the remarkable ability of 3DP to classify proteins according to their 3D structure.
The case of the conotoxins, a group of 10 to 30 residues neurotoxic peptides containing multiple intramolecular disulfide bridges and a variety of secondary structures, [52–54] provides an example of the smallest molecules listed in the PDB. Alpha-conotoxin PnIA (PDB-entry 1PEN) with 17 residues serves as reference molecule (Fig. 7a). The closest 3DP analog of 1PEN retrieved by 3DP similarity is 1AKG, which belongs to the similar α-conotoxin family and is also retrieved as nearest neighbor by BLAST search with 83 % sequence similarity (Fig. 7b) [55]. By contrast the second analog found by 3DP similarity is a short fibril peptide complex (Fig. 7c), which has a completely different secondary structure and sequence, but a similar overall size and shape. The 3rd (Fig. 7d) and 4th analog (Fig. 7e) are both from the enterotoxin family featuring a conotoxin-like shape containing double disulfide bridges, again without significant sequence similarity to the reference. Three further conotoxins missed by sequence similarity appear in the 3DP nearest neighbor search at rank 7 (1NOT, Fig. 7f), rank 9 (1HJE, Fig. 7g) and rank 33 (4TTL, Fig. 7h). In the latter case the bit value profile shows that 4TTL is the only sequence containing negatively charged residues, while all other cases do not contain any charged residues (Fig. 7i).
3DP similarity also retrieves shape analogs of molecular assemblies, as exemplified here with triose phosphate isomerase (TIM), a homodimeric enzyme which contains around 250 amino acids in the monomer [56]. The wild type TIM dimer from yeast (PDB-entry 1YPI) is selected as the reference. The top 100 neighbors of 1YPI are also protein dimers, with essentially all neighbors of 1YPI up to CBD3DP = 800 being TIM enzymes. All TIM structures in PDB (Additional file 1: Supplement 3) in fact occur within a CBD3DP distance of 1200 except 13 TIM enzymes ranked beyond rank 300 (Fig. 8a). The differences in 3DP values are illustrated for the nearest neighbor 4FF7, the rank 10 analog 8TIM at distance CBD3DP = 493 representative of the bulk of TIM structures, and 4GNJ found at rank 312 and CBD3DP = 1181 representative of the more distant group around CBD3DP = 1100 (Fig. 8b). The nearest neighbor of 1YPI, 4FF7, is also a yeast TIM with 99 % sequence similarity and nearly identical structure. PDB-entry 8TIM at rank 10 is a TIM dimer from a different organism (gallus gallus) with only 52 % of sequence similarity. In both cases the shape and fold differences to the reference are quite small, and the larger distance of 8TIM stems from differences in positively charged atoms that are not directly visible in the protein shape (Fig. 8b/c). The difference in the number and position of charged atoms also explains the further distance from the reference of 4GNJ, a TIM dimer from leishmania siamensis.
The PDB-Explorer furthermore provides interesting insight at the level of very large protein assemblies, as exemplified here for the case of a virus capsid. Starting with the subviral particle of the bursal disease virus capsid (PDB-entry 2GSY) [57] containing 27,120 residues (Fig. 9a), four analogs are readily identified within the distance range CBD3DP < 10,000. The nearest neighbor is 3FBM (Fig. 9b) which is a mutant protein of the query 2GSY, [58] and the second and third closest structures 1WCD and 2DF7 are capsids from the same bursal disease virus (Fig. 9c, d). The fourth analog (PDB-entry 3IDE) is the protein coat of Infectious Pancreatic Necrosis Virus (IPNV) [59] which has similar spikes and forms a similar icosahedral capsid organization as the capsid of infectious bursal disease virus (Fig. 9e) [59]. Interestingly the parvovirus capsid protein 1DNV (Fig. 9f), which is very close in size to the reference 2GSY, only appears much further away in CBD3DP space because its spherical shape does not feature spikes, which significantly impact its 3DP fingerprint profile (Fig. 9g).
3DP space automatically clusters proteins from the same superfamily in tight groups provided that they are in a similar size range. This property is illustrated here for the recovery of each family from a benchmark dataset of 150 CATH superfamilies, each containing between 12 and 1533 PDB entries [7, 8], considering proteins in the majority size range (Additional file 1: Supplement 4). The ROC (receiver operator characteristic) curves for recovering superfamily members from the PDB-entry closest to all same superfamily members in 3DP-space give very high AUC (area under the curve, all above 90 %) and generally high EF0.1% values (enrichment factor at 0.1 % database coverage in the range 29–1000, Fig. 10a). Most superfamilies also appear as tight groups on the PDB-map (Fig. 10b). The clustering of protein superfamilies in 3DP-space reflects the fact that the definition of these families considers similarities in folds in addition to function.
Comparing 3DP with protein structure alignment tools
The performance of 3DP was compared with three protein structure alignment tools, Fr-TMalign [60, 61], SPalign [62] and MATT [63]. Fr-TMalign is applied to pairwise structure alignment based on fragment similarity, while SPalign is designed for detecting proteins with similar fold and similar function of DNA or RNA binding. MATT (Multiple Alignment with Translations and Twists) is a program to align multiple protein structures allowing certain flexibility between fragments. These alignment methods are computationally intensive and therefore only applicable to a limited number of comparisons. They are size independent and focus on backbone alignment resulting in a focus on secondary structures. On the other hand 3DP is size dependent and considers all protein atoms indiscriminately, resulting in a sensitivity to the overall shape rather than to secondary structures. Remarkably, 3DP allows an essentially instantaneous comparison with the entire PDB when using the PDB-Explorer website.
In a first comparative study, all pairwise alignment scores were computed for the ten domain movement frames for glutamine binding protein and compared with CBD3DP. The data showed that CBD3DP, which performs comparably to overall structure RMSD as dicussed above (Fig. 3), had similar trend but higher sensitivity to conformer differences than Fr-TMalign or SPalign. MATT was highly sensitive to small conformational changes and classified all non-identical conformer pairs as low scoring (Fig. 10c).
A further comparison of 3DP with structure alignment tools was carried out for 50 homologous CDK2 proteins (Fig. 4) and 50 non-CDK2 decoy proteins. The 100 structures were in similar size range from 2100–2600 heavy atoms, and decoys consisted of non-homologous proteins with pairwise sequence identity lower than 30 %. Alignment scores and 3DP distances were computed for the 1225 CDK2 pairs and the 2500 CDK2-decoy pairs (Fig. 10d). 3DP made a relatively clear cut between CDK2 pairs and CDK2-decoy cross-pairs at CBD3DP = 750, with all CDK2 pairs found within the range CBD3DP < 1000. However the 3DP comparison recognized some decoys such as 1A8E (serum transferrin) as CDK2-like due to an overall similar shape, although this decoy had a clearly different fold as correctly analyzed by each of the three alignment tools (Fig. 10e, left). The three alignment methods correctly assigned a high score to all CDK2 pairs, but also returned a high core with part of the CDK2-decoys which were not recognized as CDK2 like by 3DP. For example decoy 4W9X, which is a non-CDK2 kinase, clearly showed a partly homologous fold to CDK2 leading to a high alignment score, but also showed substantial differences with the presence of a central helix and an extended terminal absent from CDK2 which resulted in a relatively high 3DP-distance to CDK2s such as 3PY1 (Fig. 10e, right).