Using the graph-based structural patterns, we have developed GSP4PDB, a bioinformatics tool that allows the user to analyse protein-ligand interactions within the entire protein data bank. GSP4PDB is formed by three main elements: gsp4pdb-parser, a java tool which extracts and processes data from PDB coordinate files; a relational database (PostgreSQL) which is used to store and manage protein data; and a web application which provides a graphical interface to visualize, search and explore graph-based structural patterns.
Protein data extraction and pre-processing
GSP4PDB was designed to work using data obtained from the Protein Data Bank (PDB) [16]. Therefore, we have developed gsp4pdb-parser, a command-line java application which allows to process PDB files and export the data to a relational database system.
We use rsync to maintain a local copy of the entire protein data bank. So, each time gsp4pdb-parser is executed, the relational database is updated with the latest proteins released in the main PDB repository. The current version of gsp4pdb-parser is restricted to process files encoded using the PDB format (*.pdb, *.ent or *.ent.gz).
To execute gsp4pdb-parser, the user must specify a local directory where the PDB files are stored. Hence, gsp4pdb-parser explores the directory (recursively) and prepares (internally) a list of available PDB files. Such list is filtered according to the proteins available in the relational database, whose corresponding files were processed previously. Optionally, the user can specify a list of protein IDs to be processed.
For each file of the filtered list, gsp4pdb-parser reads the file using biojava [23] and creates an object model of the protein. The main classes of the model are Protein, SChain, Aminoacid, AminoStandard, AminoStandardList, Ligand, AtomAmino, AtomLigand and Distance. Although a protein can contain many chains, at the moment only the chain with the largest number of amino acids is being processed for simplicity.
Note that a PDB file does not contain explicit information about specific atomic distances. In order to improve the performance of the system, which relies on complex join operations for the relational database, some distances are pre-computed in the initial phase.
Therefore, during the construction of the object model two distance measures are pre-computed: distance amino-amino and distance ligand-amino. The distance between two amino acids A and A′, is calculated as the minimum distance between each pair of atoms (ai,aj) such that ai∈A and aj∈A′ (i.e. we compute the distance between each pair of atoms of A and A′). A similar approach is applied to determine the distance between a ligand L and an amino acid A. Distances greater than 7.0Å are not considered as we assume that there is no interaction between the atoms. Additionally, we define the class NextAminoAmino to represent the sort between each pair of amino acids in the chain.
After the object model of the protein is constructed, gsp4pdb-parser loads the data to the relational database system using a single bulk of SQL instructions. Next we describe the relational model used to store and manage the protein data.
Protein data storage
GSP4PDB uses a PostgreSQL database system (version 9.4) for storing and managing protein data obtained from the PDB repository. The current database contains information of 147,531 proteins (latest synchronization on February 1, 2019). The database is formed by the relational tables listed in Fig. 4.
The table “protein” contains general information about each protein. Information about the twenty standard amino acids, plus an “undefined” amino acid, is stored in the table “standard_amino”. Most of the data rows in the database corresponds to the tables “distance_amino_amino” (distances between each pair of amino acids), “distance_ligand_amino” (distances between ligands and amino acids) and “next_amino_amino (sequential relationship between amino acids). Recall that the information of these tables is not provided (explicitly) by PDB, so it is computed during the pre-processing phase. The table protein_cath contains information about the CATH classification [24] of 114,593 proteins.
Figure 4 also shows the primary keys (attributes that identify rows in a table) and the foreign keys (attributes that refer a primary key in other table) in the database. Note that the attributes named “id” have been designed to describe data provenance. For instance, the atom_amino having id = “1B38_A_1_4” describes the atom number 4, that belongs to the amino acid number 1, of the chain “A”, in the protein “1B38”.
Note that the database contains duplicated data in several tables (i.e. there is data redundancy). This denormalized design was selected in order to improve query computation and, consequently, to reduce the response time of the database system. The efficiency of the systems is also supported by the inclusion of 12 B-tree indexes (indicated in Fig. 4 with the symbol Δ), plus the unique indexes created automatically for primary keys. This is a stable configuration which we expect to improve in the future.
Web user interface
GSP4PDB includes an intuitive Web interface which allows to create a protein-ligand structural pattern, search the pattern in the relational database, and explore the search results using tabular and graphical representations. The web interface can be divided in three main components (see Fig. 5): the Navigator Bar, the Design Area and the Output Area.
The Navigation Bar allows to navigate among the main elements of the interface. This bar shows the number of database entries, and includes a button to display a “How to use” popup containing a short description about the use of the tool.
The Design Area allows the user to “draw” a GSP by using drag-and-drop of buttons associated to the types of nodes and edges allowed in a GSP (the LIGAND button allows to create both ligand-nodes and any-ligand-nodes). On the right-hand side of the Design Area, there are buttons to move the pattern, delete elements, or clean the design space. There is also a “help” button which allows to display informative text (tooltips) over the buttons of the interface.
The Design Area shown in Fig. 5 contains a GSP which is equivalent to the one presented in Fig. 3. Each amino-acid-node is labeled with the 3-letter code of the corresponding amino acid, followed by its node identifier (e.g. CYS-1). Similarly, an any-amino-acid-node is labeled with the ANY prefix and the corresponding node identifier. Each distance-edge is represented as a dashed line and is labeled with a distance range (where [0.5,7.0] is the default assignment). Next-edges are represented as traditional arrows, and gap-edges are shown as dashed-arrows labeled with a gap range of the form X(min,max). The properties for nodes and edges can be changed by doing double-click on them.
Importantly any GSP can be saved and uploaded later again to allow the user to modify and optimize previous searches. In both cases, the GSP is managed as a JSON file having a special structure. In a similar way, the user is able to upload a GSP sample by clicking the “Examples” button. The Output Area shows the results of searching the GSP in the database, and provides filters that allow to further explore and analyse the results. The results can be viewed in Tabular or Gallery mode. Each row in the Tabular view mode shows information about a protein containing the pattern, a button to “see” more information about the solution (including a graph-based representation), and a “3D” button which allows to visualize the binding site in a JSmol popup.
In the Gallery view mode the solutions are shown as a collection of “cards”. Each card contains the PDB ID of one matched protein, a graph-based representation of the binding site (similar to the input GSP), and a “Show details“ button that flips the card to see additional information about the solution.
GSP4PDB includes a set of filters (or facets) that can be used to analyse the results. The filters are organized in six groups: “Protein” allows to filter the results by PDB ID, Classification and Organism; “CATH’ includes filters to explore the CATH structural hierarchy (i.e. Class, Architecture, Topology/fold and Homologous superfamily) [25]; “Ligand” is active when an any-ligand node is used; “ANY Nodes”, “Gaps” and “Distance” include a filter for each occurrence of an any-node, a gap-node or a next-edge. In order to support further off-line analysis, the user is also able to download the list of protein IDs or the solutions in their JSON encoding.
From graph patterns to sQL queries
Recall that GSP4PDB stores the protein data in a relational database (in this case, PostgreSQL). Hence, the simplest way to query the database is to use the SQL query language.
In this section we present a brief description of the method to transform a graph-based structural pattern into a SQL query expression. In general terms, the method generates a SQL query expression for each node-edge-node structure in the graph pattern. The final SQL query, expressing the complete graph pattern, is the compositions of all the sub-expressions.
The method defines transformations for the following node-edge-node structures:
Ligand ⋯ Distance ⋯ Amino
Ligand ⋯ Distance ⋯ ANY-amino
ANY Ligand ⋯ Distance ⋯ Amino
ANY Ligand ⋯ Distance ⋯ ANY-Amino
Amino — Distance — Amino
Amino — Distance — ANY-amino
ANY-amino — Distance — ANY-amino
Amino — Next → Amino
Amino — Next → ANY-amino
ANYa-mino — Next → Amino
ANY-amino — Next → ANY-amino
Amino — Gap → Amino
Amino — Gap → ANY-amino
ANY-amino — Gap → Amino
ANY-amino — Gap → ANY-amino
For instance, the SQL query corresponding to a Ligand-distance-Amino structure (case 1) is the following:
The above SQL expression is a template for querying a distance relationship between a ligand and an amino acid. Note that the parameters of the template, represented as square brackets, should be replaced with values from the graph pattern in order to obtain the final SQL expression. For the sake of space, we do not present the rest of transformations. We refer the reader to the complete documentation of GSP4PDB which is available at https://structuralbio.utalca.cl/gsp4pdb/.