GPU-based detection of protein cavities using Gaussian surfaces

Background Protein cavities play a key role in biomolecular recognition and function, particularly in protein-ligand interactions, as usual in drug discovery and design. Grid-based cavity detection methods aim at finding cavities as aggregates of grid nodes outside the molecule, under the condition that such cavities are bracketed by nodes on the molecule surface along a set of directions (not necessarily aligned with coordinate axes). Therefore, these methods are sensitive to scanning directions, a problem that we call cavity ground-and-walls ambiguity, i.e., they depend on the position and orientation of the protein in the discretized domain. Also, it is hard to distinguish grid nodes belonging to protein cavities amongst all those outside the protein, a problem that we call cavity ceiling ambiguity. Results We solve those two ambiguity problems using two implicit isosurfaces of the protein, the protein surface itself (called inner isosurface) that excludes all its interior nodes from any cavity, and the outer isosurface that excludes most of its exterior nodes from any cavity. Summing up, the cavities are formed from nodes located between these two isosurfaces. It is worth noting that these two surfaces do not need to be evaluated (i.e., sampled), triangulated, and rendered on the screen to find the cavities in between; their defining analytic functions are enough to determine which grid nodes are in the empty space between them. Conclusion This article introduces a novel geometric algorithm to detect cavities on the protein surface that takes advantage of the real analytic functions describing two Gaussian surfaces of a given protein.


Background
Macromolecules (e.g., proteins, nucleic acids, etc.) are the building blocks of living beings. In particular, proteins are relevant for the cell chemistry inasmuch they perform a variety of different functions, such as catalysts, transporters, sensors, and regulators of cellular processes. Such functions depend on the interactions that establish with other entities in the cell, namely long entities like nucleic acids (e.g., DNA) and with small entities like nucleotides, peptides, catalytic substrates, and man-made chemicals. Thus, such interactions have some flavors, namely: protein-ligand, protein-protein, protein-DNA, and so forth. It is clear that these interactions involve both shape complementarity and physicochemical *Correspondence: agomes@di.ubi.pt 1 Universidade da Beira Interior, Av. Marques D'Ávila e Bolama, 6200-001 Covilhã, Portugal 2 Instituto de Telecomunicações, Av. Marques D'Ávila e Bolama, 6200-001 Covilhã, Portugal complementarity between a protein and any other fitting entity.
Nevertheless, this article does not focus on physicochemical complementarity. Instead, the focus is on detecting cavities on the protein surface where ligands (i.e., small molecules) may bind. The detection of protein cavities is instrumental as a first step to establish the shape complementarity between a protein and a ligand. As noted by Kawabata and Go [1], identifying cavities is one of the simplest ways to predict ligand binding sites on the protein surface. In this sense, protein cavities can be seen as putative binding sites of a given protein for ligands.
The algorithms to identify binding sites on a molecular surface are divided into four categories: geometry-based, energy-based, evolution-based, and hybrid approaches [2]. In this paper, we are focused on geometry-based algorithms. These geometric algorithms are divided into three sub-categories [1], namely grid-based, sphere-based, and tessellation-based algorithms. Nevertheless, recently a more fine-grained classification for these algorithms has been reported by Krone et al. [3] and Simões et al. [4], which also considers hybrid categories as, for example, grid-and-sphere and grid-and-surface methods. Furthermore, Simões et al. [4] consider three more primary categories, including the one concerning surface-based methods.
Taking into consideration that this paper describes a hybrid grid-and-surface method, let us briefly review those methods involving grids and surfaces. Grid-based methods are characterized by mapping a protein onto an axis-aligned 3D grid, using then a particular geometric criterion to detect cavities on the protein surface. Wellknown geometric criteria are those based on distance [5,6], visibility [7,8], and depth [9,10]. Most grid-based algorithms use a visibility criterion that indicates the blocked directions (and non-blocked directions) between opposed points on the protein surface. That is, the protein surface plays the role of the occluder for cavities. Unfortunately, visibility-based grid methods are not orientationinvariant. In other words, changing protein's orientation may lead to an undetected cavity because its previously blocked scanning directions turn into unblocked ones. This cavity bounds' ambiguity results from the difficulty of distinguishing grid nodes belonging to cavities from those that do not.
Surface-based methods build upon the analytic description of the molecular surface (e.g., solvent-excluded surface [11] and Gaussian surface [12,13]) and its shape descriptors [14], namely solid angles [15] and curvatures [16], so that the surface is segmented into regions, some of which correspond to surface cavities. However, segmentations produced by shape descriptors have not proven to be effective in the detection of molecular cavities because the resulting segments may not match such cavities or tentative binding sites [12]. Zachmann et al. [17] and Natarajan et al. [16] tried to solve this problem by merging small segments into larger ones and determining larger segments using global shape descriptors, respectively. However, there is no evidence that such segments correspond to molecular cavities because no benchmarking analysis based on a ground-truth database of binding sites to evaluate the precision of those algorithms was carried out.
In turn, grid-and-surface based methods use a grid (or a lattice) together with at least a surface. Parulek et al. [18] proposed a method that combines a non-uniform lattice of randomly-generated points -which can be understood a generalization of grid-based techniques-and an implicitly-defined analytic surface defined by kernel functions to approximate the solvent-excluded surface (SES) [19]. The randomly-generated points inside the surface and those points outside such isosurface that are beyond a given distance relative to isosurface are discarded straight away; the remaining points are then subject to a mutual visibility test to retain those that are deemed to be cavity samples. Similarly, Krone et al. [8] use a Gaussian surface that better adjusts to SES, in conformity with the parameters set in [20] and [19]. But, instead of using sample points of the domain outside the surface, they used the vertices of the surface mesh triangles to test mutual visibility through an ambient occlusion-based visibility criterion due to Borland [21]. In both methods, the idea was to extract and track protein cavities in the context of molecular rendering and visualization, not on evaluating the accuracy of any cavity detection method relative to a certified ground truth.
As mentioned above, this paper addresses a grid-andsurface method, here called GaussianFinder. This method combines two Gaussian surfaces of a given protein, called inner and outer surfaces, as a way of finding cavities as clusters of voxels located between those two surfaces. As shown further ahead, this solves the ambiguity problems of grid-based methods mentioned above, i.e., the problems faced in the delineation of the limits of protein cavities, without using any visibility criterion of the grid-and-surface methods above to find cavities on the molecular surface. GaussianFinder aims at finding protein cavities accurately relative to ground-truth binding sites certified by known databases, as the one known as PDBsum (www.ebi.ac.uk/pdbsum/) [22].
Before proceeding any further, let us also mention the methods 3V and KVFinder due to Voss and Gerstein [23] and Oliveira et al. [10], respectively, resemble our method in solving the cavity ceiling and ground-and-walls ambiguities. But, while we find cavity voxels between two analytical surfaces, neither 3V nor KVFinder uses analytic surfaces to find such cavity voxels. Instead, they use probe and solvent spheres in conjunction with a grid, so they are grid-and-sphere methods [4]. 3V produces two voxelized volumes, the first of which is a discrete approximation to the solvent-excluded surface (SES), while the second approximates an inflated SES. The first voxelized volume is obtained after two steps. The first step collects all voxels inside atom-centered spheres whose radii are given by the van der Waals radii plus the water sphere radius of 1.5 Å, resulting in a voxelized volume that approximates the solvent-accessible surface (SAS). The second step discards voxels inside each solvent sphere centered at each frontier voxel of the SAS voxelized volume, resulting in a voxelized volume that approximates SES. This two-step procedure is repeated for the second voxelized volume, with the difference that one replaces the water sphere radius by a default probe sphere radius of 6.0 Å, so that the resulting voxelized volume approximates an inflated SES. Therefore, the cavity voxels are those that result from the difference between the second voxelized volume and the first voxelized volume that approximates SES.
In regards to KVFinder, one obtains the cavity voxels by the difference of two but different voxelized volumes. KVFinder uses a solvent sphere of radius 1.4 Å and a default probe sphere radius of 4.0 Å. However, this method only operates on grid points outside of the molecule atoms. In the first step, KVfinder collects all outside grid points such that the solvent sphere centered at each outside grid point fits in the empty outside space without overlapping the molecule. The second step is identical to the first one, with the difference that one uses the default probe sphere instead of the solvent sphere. The cavity voxels are those that belong to the first voxelized volume, but not to the second one. Therefore, cavity voxels correspond to the empty outside space where the solvent sphere gets in, but the default probe sphere does not.

The Gaussian surface
GaussianFinder builds upon the concept of Gaussian surface, which is defined as the level set is given by the following expression: where x i and r i stand for the center location and van der Waals radius of the i-th atom, respectively, while β represents the Gaussian kernel decay value. Therefore, the Gaussian surface depends on two parameters, c and β [24,25].

The leading idea
GaussianFinder identifies cavity grid nodes between two Gaussian surfaces, F in (x) = c and F out (x) = c of each protein (see Fig. 1), which are defined by the following two functions: and where R i = r i + w i , with w i = 1.4 Å standing for the radius of the water molecule. The idea is to find cavities between the inner and outer surfaces where one or more water molecules fit. Assuming the axis-aligned bounding box D enclosing the protein has been previously decomposed into equally-sized cubic voxels of length = 1.0 Å, the minimum size of a cavity is a boxed region of 3 × 3 × 3 voxels, i.e., a minimum volume of 3.0 Å 3 . Furthermore, the parameterization (c, β) was set to (1.0, 2.3) for both inner and outer surfaces because it is the one that more closely approximates the solvent-excuded surface (SES) [20,24,[26][27][28].

The GaussianFinder method: overview
The diagram of the GaussianFinder method is shown in Fig. 2. Before running the GaussianFinder on GPU, one performs three preprocessing steps as follows: • Read atomic centers of a protein from the PDB file (http://www.rcsb.org) in an array on CPU side. • Determine the bounding box D ∈ R 3 that encloses the input protein on CPU side. This involves the computation of both minimum and maximum of the coordinates x, y, and z of the centers of all protein atoms, that is, the triples p = (x min , y min , z min ) and q = (x max , y max , z max ). These coordinates are then updated such that p = p − 2R and q = q + 2R, where R is the maximum atomic radius among the atoms belonging to the molecule, as needed to guarantee that the molecule lies in the box D. • Copy the array of atomic centers into GPU memory, and allocate GPU memory for the following 3D After completing the pre-processing stage, Gaussian-Finder identifies the cavities of an input protein through the following seven steps on GPU: Note that the PDB file reading operation runs on CPU side. Then, the array of atomic centers (i.e., triples of coordinates x, y, and z) allocated in memory is transferred to GPU memory using the CUDA (Compute Unified Device Architecture [29]) function cudamemcpy. After that, the CUDA kernels encoding the GaussianFinder steps, a kernel per step, are ready to run on GPU one after another, as described below. However, the last step runs on CPU side using the DBSCAN algorithm [30], as needed to cluster cavity voxels into separate cavities.

Voxelization of the bounding box -Kernel 1
This is the first CUDA kernel. The voxelization of the bounding box D consists in partitioning D into a grid of equally-sized voxels (i.e., cubes) of length = 1.0 Å. Considering that the voxels are all axis-aligned, it thus suffices using only the 0-th corner (also called node) of each voxel to represent it, because the remaining seven corners of a voxel are 0-th corners of its adjacent voxels. Therefore, it suffices to allocate a 3-dimensional array of such 0-th corners representing the voxels on GPU side; this array is named V. The location of each 0-th corner is also calculated on the GPU side.

Computation of F in -Kernel 2
This kernel launches N threads (i.e., the size of array V), one per 0-th corner. Each thread calculates the value of F in (see Eq. (3)) at each corner in V. These function values are stored in a 3D array on GPU, called FIN, with the same size as V. But, before running this CUDA kernel on GPU, it is first necessary to allocate memory for FIN on GPU, as described in the third pre-processing step.

Computation of F out -Kernel 3
This kernel is identical to the previous one, with the difference that now we use another 3D array on GPU to hold the values of F out (cf. Eq. (4)).

Computation of voxel flags for F in -Kernel 4
To determine the intermediate voxels between the inner and outer surfaces in Step 6, we need to find the voxels outside of the inner surface. For that purpose, we determine the 8-bit flag for each voxel of the scalar field F in . Each bit is associated with each voxel corner so that we have 2 8 = 256 possible configurations for each voxel. If F in < c at a voxel corner, its bit takes the value 1; otherwise, it takes on the value 0. Therefore, the flag 11111111 2 = 255 10 indicates that the corresponding voxel is outside the inner surface because the value of F in decreases with the distance to the protein. The flags are stored in a 3D array, called FLAGIN, which is of the same size as FIN.

Computation of voxel flags for F out -Kernel 5
This kernel is the same as the previous kernel, with the difference that now the computation of voxel flags is for F out instead of F in . But, now we are interested in voxels whose flag is 00000000 2 = 0 10 , that is, voxels inside of the outer surface. The flags are stored in a 3D array, called FLAGOUT, which is of the same size as FOUT.

Identification of the intermediate voxels -Kernel 6
Based on the results of 4-th and 5-th kernels, an intermediate voxel (i, j, k) between the inner and outer surfaces is easily identified through the condition FLAGIN(i, j, k) = 255 and FLAGOUT(i, j, k) = 0, here called the intermediate condition.

Identification of cavity voxels -Kernel 7
This kernel retrieves the set of cavity voxels from the set of intermediate voxels. Note that not all intermediate voxels are cavity voxels. The condition for an intermediate voxel being a cavity voxel is that it is surrounded by a 3 × 3 × 3 neighborhood of intermediate voxels. This is so because we have to guarantee a water molecule of radius 1.4 Å fits inside a cavity. Finally, the set of cavity voxels encoded into a 3D array called CAVITYVOXELMARK is copied back to CPU via the function cudaMemcpy3D to be processed by the DBSCAN clustering algorithm.

Formation of protein cavities
The last step of the GaussianFinder runs on CPU. We use the DBSCAN clustering algorithm to separate cavity voxels into clusters featuring protein cavities. The code of DBSCAN is publicly available at https://github. com/gyaikhom/dbscan. The reader is referred to Ester et al. [30] for further details about DBSCAN.

Molecular triangulation
The graphics visualization of each protein requires the triangulation of the Gaussian molecular surface defined by F in (x) = c. This triangulation is carried out entirely on GPU side using the variant of the marching cubes algorithm introduced by Dias and Gomes [31][32][33][34]. Figure 3 shows the Gaussian surfaces (in gray) of tree proteins after their triangulation, as well as some of their cavities, whose locations are identified by small balls in red, as determined by the GaussianFinder. The small balls in blue indicate the certified locations of the same cavities as given by the PDBsum ground truth. We see that there is a match between the locations of cavities calculated by our algorithm and those determined by PDBsum dataset.

Results
The experimental testing results were obtained using a methodology built upon the following aspects: (i) hardware/software setup; (ii) a ground-truth dataset of protein cavities; (iii) set of benchmarking protein cavity detection methods; (iv) performance quality; (v) GPU time performance; and (vi) GPU memory space consumption.

Hardware/software setup
Testing was accomplished using a desktop computer running the Linux Fedora 25 operating system and equipped with an Intel-Core I7 6800K 3.4GHz GHz Processor, 32GB RAM, one Nvidia Tesla K40, and one Nvidia Quadro M6000. Most computations to detect cavities of proteins and other molecules took place on the Nvidia Tesla K40. Also, all the computations needed to triangulate surfaces of molecules and their cavities were performed on the same Nvidia Tesla K40. The Nvidia Quadro M6000 was only used for graphics output and visualization.
Furthermore, GaussianFinder was written in C/C++ together with CUDA 9.0 to run on GPU. As noted above, we used the DBSCAN clustering algorithm to form clusters of cavity voxels featuring protein cavities. This clustering step runs on CPU side. Triangulating and rendering surfaces of proteins and binding sites on GPU were performed using a variant of the GPU-based implementation of the marching cubes algorithm by Dias and Gomes [31][32][33][34].

Ground-truth dataset of protein cavities
We used PDBsum (www.ebi.ac.uk/pdbsum/) as the ground-truth dataset of protein cavities because it provides us with already known binding sites for a set of proteins [22]. In practice, we only used a subset of proteins in PDBsum; specifically, we used the dataset of proteins available in the LigASite database [35] which consists of 816 apo proteins and 1788 holo proteins, in a total of 2604 proteins. Recall that an apo protein is a protein without ligands, while a holo protein is a protein-ligand complex. The corresponding PDB files were retrieved from PDB Data Bank (www.rcsb.org). By inspection of the LigASite dataset in the PDBsum, we counted 8150 cavities on apo proteins, and 17850 cavities on holo proteins.

Benchmarking cavity detection methods
For benchmarking sake with GaussianFinder, we used the following protein cavity detection methods: • POCASA. It is essentially a grid-based method, called Roll, though it also uses a crust-like surface of probe spheres (see Yu et al. [36]). • SURFNET. It includes the sphere-based method proposed by Laskowski [37]. • PASS. It includes the sphere-based method proposed by Brady et al. [38]. • Fpocket. It includes a triangulation-based method based on a Voronoi tessellation and alpha spheres on the top of a convex hull algorithm (see Guilloux et al. [39]). • GHECOM. It includes the sphere-based method proposed by Kawabata [40]. • ConCavity. It includes the grid-based method proposed by Laskowski [41]. • 3V. This grid-and-sphere method was proposed by Voss and Gerstein [23].
These methods and the GaussianFinder were run on the same desktop computer to guarantee a fair comparison between them. Note that the first six methods listed above are also part of Metapocket [42].

Quality of performance
Let us now to analyze the performance quality of each benchmark cavity detection algorithms relative to the PDBsum ground-truth dataset of apo and holo proteins. For that purpose, we first counted 8150 cavities on the 816 apo proteins, and 17850 cavities on the 1788 holo proteins of the ground-truth dataset.
Then, upon execution of the DBSCAN, we extracted the number of clusters identified as cavities, here called positive cavities C P . These positive cavities include the true positive (TP) and false positive (FP) cavities (see Tables 1  and 2). We use the PDBsum ground-truth dataset, where the certified cavities are described per protein, to decide if a positive cavity outputted by DBSCAN is either a true positive or a false positive. Such a decision builds upon the overlapping condition which states that the geometric center of a protein cavity, as determined by a given benchmarking method, must be within a distance d ∈ [ 0.0, 4.0]Å from the geometric center of the homologous cavity provided by the PDBsum ground-truth. For example, Table 1 shows the GaussianFinder was able to identify 8730 apo protein cavities within a maximum distance d = 4.0Å, 7697 of which were correctly identified; that is, for GaussianFinder, C P = 8730, TP = 7697, and FP = C P − TP = 1033. Table 1 Performance of benchmarking detection methods for apo proteins in terms of: (d) distance (FN) false negatives to PDBsum ground-truth cavity centers; (TP) true positives; (FP) false positives; (TN) true negatives; (S v ) sensitivity; (S c ) specificity; (a) accuracy; (r d ) ratio of detected ground-truth cavities; and (C u ) cumulative number of undetected ground-truth cavities  Note that the maximum distance d = 4.0 between geometric centers of homologous cavities has to do with the minimum size of a cavity, which in turn is related to the size of the water molecule. Most algorithms consider that the water molecule has a radius of 1.4 Å to 1.8 Å, so its diameter is 3.6 Å maximum. For example, Paramo et al. [43] use a 50 Å 3 for the cavity's minimum size, which corresponds to a cube length of 3.684 Å. Thus, a distance of 4.0 Å between the center of cavity detected by a given method and the center of its homologous cavity in the PDBsum ensures that such cavities extensively overlap, unless they are very small cavities. In fact, as Pérot et al. [44] noted, a drug-binding cavity has an average volume of about 930 Å 3 when one uses a geometric-based method [14], and about 610 Å 3 in the case of using an energy-based approach to detect pockets [45].
Finally, it is worth noting that DBSCAN rejects some clusters as cavities, here called negative cavities C N . These negative cavities include the true negative (TN) and false negative (FN) cavities (see Tables 1 and 2). So, we repeat the matching process between negative cavities and ground-truth cavities to decide which of them are not cavities truly (TN), and, consequently, those that are cavities but that were incorrectly classified as not (FN). For example, Table 1 shows that DBSCAN rejected 2438 clusters as cavities of apo proteins, 393 of which are cavities indeed; that is, for GaussianFinder, C N = 2438, TN = 2045, and FN = C N − TN = 393.
The performance quality of the predictions can be assessed using various metrics, namely: sensitivity or true positive rate S v = TP TP+FN , specificity or true negative rate S c = TN TN+FP , accuracy a = TP+TN TP+FP+FN+TN , rate of detected ground-truth cavities r d = TP C , and undetected ground-truth cavities (C u = C − TP). Recall that the number of apo protein ground-truth cavities is C = 8150, while C = 17850 is the number of groundtruth cavities for holo proteins. From Tables 1 and 2, we observe that all methods have high values of sensitivity (S v > 0.9), but GaussianFinder ranks behind PASS, GHE-COM, and Fpocket regarding specificity because the value of TN is not much greater than the value of FP. However, these four methods possess an accuracy about 90% (S c ≈ 0.9). Among these methods, GaussianFinder ranks first because its rate of detected ground-truth cavities (r d ) stands out above the other methods (see Fig. 4). This means that GaussianFinder is more accurate than other benchmark methods relative to the number of detected ground-truth cavities. Note that the number C u of undetected ground-truth cavities is far less for GaussianFinder than for any other method.

Time performance
The experimental time performance of our cavity detection algorithm on GPU is shown in Fig. 5a, whose (dashed) trend line satisfies the following expression: That is, the GaussianFinder runs in O(n) time. Eq. (5) was obtained by curve fitting [46]. Thus, the experimental time complexity of our method is linear on GPU. For example, finding the cavities of a molecule with 3000 atoms takes about 0.24 s GPU. For the entire set of proteins, the GaussianFinder takes 636.40 s (11 minutes approximately) to determine all the data needed to pass to DBSCAN algorithm to make the cavities of all proteins. These times are end-to-end GPU run-times, i.e., times needed to run the seven steps or kernels of the GaussianFinder.

Memory space consumption
A brief glance at Fig. 5b shows that the memory consumption is linearly related to the increase of the number of atoms. But the memory consumption of this algorithm is not very high when compared with other algorithms that also use a grid-based approach. This is so because the grid spacing (or voxel length) is 1.0 Å for GaussianFinder. It is clear that a smaller grid spacing would consume much more memory space on GPU.

Discussion
In light of previous results, also depicted in Fig. 4, we summarize our findings as follows. In our experiments, GaussianFinder seemingly outperforms all other cavity detection methods. Additionally, grid-based methods (ConCavity, and POCASA) are less accurate than sphere-based methods (SURFNET, PASS, and GHECOM) in our test conditions; in turn, sphere-based methods are less accurate than triangulation-based methods (Fpocket). In regards to the grid-and-sphere methods, we observe that KVFinder ranks third together with GHECOM, just behind Fpocket, while 3V performs not so well, but even so with a cumulative cavity percentage above 60%. Note that we used default parameters to obtain those results; Furthermore, every single benchmark geometric method tends to detect most cavities in the first interval [0.0, 1.0]. Also, every single benchmark method performs better for holo proteins than for apo proteins. Note that, in our tests, we only considered geometric detection methods for cavities (i.e., tentative binding sites). Moreover, we used actual locations of binding sites of proteins (via PDBsum) as the ground-truth for the cavities detected by those benchmarking methods, including GaussianFinder.

Conclusions
We have introduced a novel grid-and-surface based algorithm, called GaussianFinder, for identifying cavities on protein surfaces without using a visibility criterion. The leading idea of the method is to determine the grid nodes between two Gaussian isosurfaces of each molecule, which are then aggregated into clusters of nodes featuring cavities. This avoids possible geometric ambiguities (concerning the limits of cavities) inherent to the use of grid-based methods to detect cavities of the protein surface. GaussianFinder is considerably fast, with the cavity detection stage finishing in a matter of a few seconds on a GPU-based workstation equipped with a Nvidia Tesla K40 and a Nvidia Quadro M6000. Shortly, we intend to parallelize other cavity detection algorithms existing in the literature for a more comprehensive comparison between algorithms in terms of time performance.

Availability and requirements
Project name: GaussianFinder; Project home page: sourceforge.net/projects/gaussianfinder; Operating system(s): Linux Fedora 25; Programming language: C/C++; Other requirements: CUDA 9.0; Any restrictions to use by non-academics: The source code is freely available under the GPLv3 License.