Python-C parallel KVFinder (pyKVFinder) applies a Simplified Wrapper and Interface Generator (SWIG; http://www.swig.org/) to extend grid operations written in C to Python, a high-level programming language. pyKVFinder can be easily installed with the pip package management system. In pyKVFinder, the target biomolecule is inserted into a regular 3D grid, which is stored as an ndarray, considering the van der Waals radii of the atoms. To detect cavities, pyKVFinder uses a dual-probe algorithm that scans the biomolecular structure, as described in [7, 8]. In summary, a small probe, dubbed Probe In, and a larger probe, dubbed Probe Out, translate over the grid points, defining two molecular surfaces with different accessibility. Biomolecular cavities are defined as the non-overlapped regions between the molecular surfaces. On each detected cavity, pyKVFinder may perform spatial, depth, hydropathy and constitutional characterizations. In this workflow, C routines undergo multithreaded parallelization to speed up cavity detection and characterization, with the OpenMP API creating a user-defined number of parallel threads, where grid chunks are distributed among these threads to perform independent operations.
Python package
pyKVFinder can be imported as a package in the Python environment and users can decide to run the full cavity detection and characterization workflow through the run_workflow function or run pyKVFinder functions in a stepwise fashion. At the latter, users can integrate pyKVFinder functions into third-party Python packages and benefit from interactive IPython/Jupyter notebooks. By running pyKVFinder in Python environment, users can visualize the detected cavities through the Python NGL Viewer widget [21]. Still, the full workflow was also coded as a command-line interface. Either way, users have access to a full set of customizable parameters to detect and characterize biomolecular cavities. A schematic diagram of cavity detection and characterization workflow is described in Fig. 1.
In the cavity detection and characterization workflow, read_vdw function reads a customizable van der Waals (vdW) radii file (.dat extension; default vdw.dat) into a nested Python dictionary, which is first indexed by the three-letter residue code (e. g., ALA, ARG, ASH, etc.), and then indexed by the atom name (e. g., C, N, O, CA, etc.) to access its respective radius value. The vdW radii file defines the radius values for each atom by residue and when not defined, uses a generic value based on the atom type. The built-in file (vdw.dat) applies the vdW radii of the Amber ff99 force field [22]. Afterwards, read_pdb or read_xyz function gets the vdW dictionary and reads a target PDB or XYZ file (.pdb or.xyz extension), respectively. The atomic data is stored in an ndarray with residue number, chain identifier, residue name, atom name, xyz coordinates and radius per atom.
To perform the cavity detection on the whole structure, the 3D grid is defined based on the target biomolecule coordinates and the user-defined parameters: grid spacing and Probe Out diameter. Thus, the grid coordinates are extracted from the atomic data ndarray, using get_vertices function that defines the grid origin and XYZ axis. The first vertex (p1) is the minimum xyz coordinates of the atomic data. The second (p2), third (p3) and fourth (p4) vertices represent the maximum coordinate along the X, Y and Z axes, respectively. Given these four vertices, the original coordinate system is transformed in the detect function, using translation and rotation, into a new coordinate system with P1 as its origin. Also, for internal calculations, the sum of the grid spacing and the Probe Out size is padded in each direction of the 3D grid. With user-defined parameters and atomic coordinates, the detect function creates and fills the 3D grids with Probe In and Probe Out probes, and compares these grids to define the biomolecular cavities, which are returned in an ndarray. In this ndarray (Fig. 2), each element corresponds to cavity space (> 1), empty space (1), biomolecule space (0) or bulk space (− 1).
Instead of performing the cavity detection on the whole structure, the search space can be set to a custom search box, called box adjustment mode. This mode is defined based on a TOML-formatted configuration file (.toml extension; Additional file 1: Fig. A1), which can explicitly define the vertices coordinates or pass a list of residues of the target biomolecule to be used as reference instead of the whole structure. The box can also be drawn using parKVFinder PyMOL plugin and its coordinates can be extracted from parKVFinder parameters file. Thus, the get_vertices_from_file function loads a box configuration file and atomic data from read_pdb or read_xyz function, which returns the grid coordinates (p1, p2, p3 and p4) and selects the atoms inside the custom 3D grid with their respective atomic data. Hence, these parameters are passed to detect function, together with box_adjustment flag, to perform cavity detection on a custom 3D grid.
Either way, whole structure mode or box adjustment mode, another space segmentation method, called ligand adjustment mode, can be applied to constrain the search space around a target ligand, defined by a PDB or XYZ file (.pdb or.xyz extension). For this mode, read_pdb or read_xyz function gets the vdW dictionary and reads a ligand PDB or XYZ file and returns the ligand atomic coordinates with their respective radius that must be passed to detect function to further restrict the search space within a radius of the target ligand.
Cavity characterization
With the ndarray of detected cavities, pyKVFinder package may perform four characterization procedures, i.e., spatial, constitutional, depth and hydropathy characterizations. The spatial characterization, using spatial function, defines the surface points in an ndarray, and estimates the volume and area of the detected cavities, following the methodology proposed in [8]. The constitutional characterization, using constitutional function, defines the interface residues that surround the detected cavities, checking if the atoms of the residues are within a radius, which is the sum of the Probe In size and the atom radius. Alternatively, the constitutional function can ignore backbone contacts by enabling ignore_backbone flag. This function stores the interface residues in a Python dictionary. Using the interface residues, the calculate_frequencies function calculates the occurrence of each residue and classes of residues per cavity. The classes of amino acid residues [23] are aliphatic apolar (R1), aromatic (R2), polar uncharged (R3), negatively charged (R4), positively charged (R5) and non-standard (RX) (Additional file 1: Table A1).
The depth characterization identifies the degree of burial of the binding site. This characterization applies depth function to identify the cavity-bulk boundary by applying a spatial filter, which defines a boundary point as cavity points where at least one direct neighbor is a bulk point, marking it with the negative of the cavity integer (Additional file 1: Fig. A2). Subsequently, the depth of each cavity point is heuristically estimated by the shortest Euclidean distance between the cavity point and its respective boundary points (Eq. 1). Each cavity point is distributed among the user-defined number of threads to perform this chunk of distance calculations (Additional file 1: Fig. A2). With the depth of all cavity points calculated and stored as an ndarray (Fig. 2), the maximum and average depths are calculated for each detected cavity and returned as Python dictionaries.
$$\widehat{{D_{i} }} = \mathop {min}\limits_{j} \left[ {d\left( {p_{i} ,p_{j} } \right)} \right] = \mathop {min}\limits_{j} \left[ {\sqrt {\left( {p_{{i_{x} }} - p_{{j_{x} }} } \right)^{2} + \left( {p_{{i_{y} }} - p_{{j_{v} }} } \right)^{2} + \left( {p_{{i_{z} }} - p_{{j_{z} }} } \right)^{2} } } \right],$$
(1)
where \(\widehat{{D_{i} }}\) is the depth of the cavity point i, \(d\left( {p_{i} ,p_{j} } \right)\) is the Euclidean distance between points i and j, \(p_{i}\) is the x, y, z coordinates of a cavity point i, \(p_{j}\) is the x, y, z coordinates of a cavity-bulk boundary point j.
The hydropathy characterization extracts the water attractiveness of the interface residues surrounding the binding site. This characterization uses hydropathy function at surface points detected in spatial characterization to map a target hydrophobicity scale on them. Firstly, a customizable TOML-formatted hydrophobicity scale file (.toml extension) is loaded into a Python dictionary, which is indexed by the three-letter residue code (e. g., ALA, ARG, ASH, etc.) to access the respective hydrophobicity value. With the dictionary loaded, the function identifies the nearest interface residues for each surface point and projects the residue’s hydrophobicity value onto them (Fig. 2). Alternatively, backbone contacts may be ignored by enabling the ignore_backbone flag. Finally, with the hydrophobicity mapped on the surface and returned as an ndarray, the average hydropathy is calculated for each detected cavity and stored in a Python dictionary. The hydrophobicity scale file defines the scale name and hydrophobicity value for each residue and, when not defined, assigns zero to the missing residues (Additional file 1: Fig. A3). The package contains six built-in hydrophobicity scales: Eisenberg and Weiss [24], Hessa and Heijne [25], Kyte and Doolittle [26], Moon and Fleming [27], Wimley and White [28] and Zhao and London [29]. Benefiting from the Python environment, users can test different scales without having to perform the cavity detection step every time. A pre-released version of hydropathy characterization has been successfully applied to compare cavities of alphavirus-related proteins [30].
The cavity detection and characterization objects, from detect, spatial, constitutional, calculate_frequencies, depth, hydropathy functions, can be stored into a pyKVFinderResults class that accumulates them in its attributes. As this data structure is filled, these attributes can be exported to files through export, write and plot_frequencies methods. The export method writes cavities together with their surface points to a PDB-formatted file with depth values in the B-factor column, while surface points with hydropathy values in the B-factor column are written to another PDB-formatted file. The write method saves file paths, volume, area, interface residues and their frequencies, maximum and average depth, and average hydropathy in a TOML-formatted file. The plot_frequencies method plots bar charts of frequencies per cavity in a PDF file. These three methods are also wrapped in export_all method. Although we presented the full workflow, all functions explained in this section can be applied independently in a step-by-step manner. In this scenario, the export, write and plot_frequencies methods have their counterparts in the export, write_results and plot_frequencies functions of pyKVFinder package, respectively. Additionally, the read_cavity function reads a cavity file (.pdb extension), written by pyKVFinder, parKVFinder or KVFinder, and a target PDB or XYZ file (.pdb or.xyz extension), and returns an ndarray with each element corresponding to the cavity space (> 1), biomolecule space (0), or bulk or empty space (− 1), similar to the output of the detect function. In this way, it allows to recharacterize a previously detected cavity or characterize a cavity with manually trimmed points.