Algorithm 1 BINDSURF overview
1: Read main simulation configuration file bindsurf_conf.inp
2: Generate ES and VDW grids (es_grid, vdw _grid) of the protein using GEN_GRID
3: Generate ligand _conformations with GEN _CONF
4: Read protein and calculate surface _spots using GEN _SPOTS
5: for all ligand_conformations do
6: Calculate initial_configuration of the system on GPU (protein, surface_spots, ligand_conformation) using GEN_INI
7: Surface Screening using SURF_SCREEN (initial_configuration, ligand_conformation, protein, surface_spots, es_grid, vdw_grid)
8: end for
9: Process results
This Section introduces the underlying design of BINDSURF which is summarized in Algorithm 1. All necessary simulation parameters are specified in a configuration file called bindsurf_conf.inp which contains the following information:
Molecular data: Filenames related with protein, ligand and input-output directories. Also, number of ligand conformations that will be generated from the input one.
Force field data: ES, VDW and HBOND force field parameters used in the scoring function. By default we provide the OPLS force field .
Monte Carlo minimization parameters: number of steps of the simulation, energy cutoff, maximum values for random shift and rotation.
Output related data: file names for the different output files such as for graphical display through Pymol , energy distributions, detailed information about the poses with the lowest scoring function values, etc.
Once bindsurf_conf.inp has been read and processed, we generate an ensemble of ligand conformations from the input ligand using GEN_CONF; an ad hoc version of the FlexScreen docking program . Given the modular structure of BINDSURF, any other program which generates ligand conformational ensembles can be used for the same purpose.
Next, we need the information pertaining to the areas of the protein surface (spots) where each individual simulation will take place. It is important to find an optimal number of spots, since too many spots would increase unnecessarily the total computation time, and few spots would not cover completely the protein surface. We have found that a good strategy, implemented in GEN _SPOTS, consists in the calculation for the input protein of the coordinates of the alpha carbons of each residue. All parallel simulations will take place in spherical regions, defined by the centers of these spots, with a cutoff radius of 10 Angstroms. This module can be easily changed to any other that just outputs the list of the coordinates of the spots.
In the fourth step of Algorithm 1 we generate the electrostatic (ES) and van der Waals (VDW) grids of the protein using the GPU program GEN_GRID, as depicted in our previous work . The main idea underlying GEN_GRID is to suppose that the protein is rigid, and then to precomputate the electrostatic potential and the neighbours lists, in a regular spatial disposition of points called grid , as depicted in, that comprises both the protein and the ligands. GEN_GRID reads its parameters from the configuration files, generates the grids belonging to the input protein and writes them into different output files that are consequently processed during conformation generation and surface screening, resulting in fast computations of the scoring function.
Before the surface screening process can begin, we need to generate all necessary simulation parameters from the input ligand and protein with the GPU program GEN_INI. For a given ligand ligand_conformation, GEN_INI performs random ligand translations and rotations in order to obtain valid starting conformations of the ligand on each protein surface spot. Once the system is set up, the program SURF _SCREEN performs the protein surface screening procedure. Finally, BINDSURF reports statistics of the obtained results, Pymol files for its convenient 3D visualization as well as many other reports, as will be shown in the results section.
Surface screening on GPU
In this Section, we introduce the main core of the BINDSURF program (namely SURF_SCREEN). Algorithm 2 shows the host side pseudocode of BINDSURF. Firstly, the previously obtained information regarding protein and its precomputed grids and surface spots, the ligand conformation, and the simulations initial states are transferred from CPU to the GPU, where all the simulation process takes place.
Algorithm 2 Host side of the SURF_SCREEN core of the BINDSURF application for a given ligand conformation
1: CopyDataCPUtoGPU (protein, es_grid, vdw_grid, surface_spots, ligand_conformation; shifts_set, quaternions_set, initial_configuration)
2: for i = 0 to number_of_spots/BLOCKSURF do
3: for n = 1 to numSteps do
4: if n is even then
5: GenerateShiftsKernel (randomStates, shifts_set)
7: GenerateRotationsKernel(randomStates, quaternions_set)
8: end if
9: Energy(es_grid, vdw_grid, protein, ligand_conformation, shifts_set, quaternions_set, newEnergy)
10: if n is even then
11: UpdateShiftsKernel(randomStates, oldEnergy, newEnergy)
13: UpdateRotationsKernel(randomStates, oldEnergy, newEnergy)
14: end if
15: end for
16: FindMinima(oldEnergy, minIndexes, minEnergy)
17: CopyDataGPUtoCPU(minIndexes, minEnergy, shifts_set, quaternions_set)
18: end for
On each spot, many simulations (in this case, 128) for each ligand conformation are carried out in parallel on the GPU. The protein-ligand interaction energy is minimized using a parallel adaptation of the Monte Carlo method, utilizing the Metropolis algorithm . Required random numbers in Monte Carlo are generated using the NVIDIA CURAND library , which later are employed to perform the required ligand rotations and displacements in parallel.
As a minimization process, the next iteration always depends on the previous one. Thus, the loop comprised between steps 3 to 15 in the Algorithm 2 is not affordable for parallelization. Therefore, only the internal computation is paralellized; i.e. the generation of shifts and rotations, energy calculation and the update of simulation state.
Moreover, we cannot launch simultaneously all the threads we need for the execution of all the simulations in parallel because the number of threads needed is greater than the maximum allowed. Hence, we only perform simulations for a maximum number of spots (BLOCKSURF value) simultaneously (line 2 in the Algorithm 2).
On one hand, each simulation needs to have a copy of the ligand that can modify. On the other hand, the number of simulations required in this process is huge, and thus it is not viable to have a copy of all information related to the ligand atoms in the GPU memory, such as for instance, all the atom positions. An alternative way of representing the ligand information, which is independent of the ligand size and thus benefits its allocation in the scarce internals of the GPU memory, is to keep a model of the ligand in the GPU constant memory. In this way, the state of each simulation is represented by one 3D point and a quaternion which represent the displacement and rotation about the origin, accumulated along the simulation. This solution can be applied because we use a rigid representation of the molecules.
The Monte Carlo process alternates different steps of rotation and displacement. Thus, we developed two different kernels; (1) for generating displacements of the simulations (called GenerateShiftsKernel), and (2) for generating rotations (called GenerateRotationsKernel). These kernels generate a random move using a local copy of the random state of each simulation and do not modify the random state in global memory. Later, if that move is accepted, the random state in global memory is updated with the random state that generated this movement; otherwise, the random state is not updated and the move is undone in the simulation state.
The function Energy in Algorithm 2 launchs highly GPU optimized non-bonded interaction kernels  for the description of the electrostatic, Van der Waals and hydrogen bond interactions between the ligand and the protein. These kernels are named ESKernel and VDWKernel, which are described in posterior sections. Once the energy is calculated, UpdateShiftsKernel and UpdateRotationsKernel check whether the previous energy values are smaller than the new values calculated for the energy, and if so the movement made is applied permanently to the simulation state.
The minimum value found for the energy belonging to the same sphere surface is obtained by FindMinima function. This function launches two different kernels; (1) a kernel to reduce the energy vector, which stores all energy values calculated in all simulations, and (2) a kernel to compact this vector, in order to reduce the data transferred to the CPU. Finally, we obtain a vector which contains the minimum energy obtained in the simulations for each spot.
Once the simulations are carried out on each protein spot in the surface, in the final output BINDSURF produces for each ligand detailed information about the protein spots where the strongest interactions are found for the different ligand conformations. This information can be parsed directly to PyMOL  to get a graphical depiction of the results. The information regarding the hotspots obtained for different ligands, and the set of ligands with the lowest values of the scoring function, is thought to be later employed in a more detailed VS methodology to screen only this resulting set of ligands in the hotspots found by BINDSURF. Other option is to pass the resulting ligand binding pose obtained by a detailed VS method for a binding site as input for BINDSURF to check whether it could interact in other parts of the protein surface.
In the next subsections we describe how is the scoring function calculated in both CPU and GPU versions.
ElectroStatic (ES) energy calculation
Algorithm 3 Sequential pseudocode for the calculation of the electrostatic potential
1: for i = 1 to N _simulations do
2: for j = 1 to nlig do
3: energy[i * nlig + j] = q
* interpolate (lig[i * nlig + j], ESGrid)
4: end for
5: end for
The precomputed ES grid is generated by the program GEN_GRID as described in  and afterwards, it is read by SURF_SCREEN from file and loaded onto memory. The calculation of the electrostatic potential for the protein-ligand system is performed as follows; for each ligand atom i with charge q
at point P
we calculate the eight closest protein grid neighbours. Next, an interpolation procedure is applied to estimate the value of the electrostatic potential due to all protein atoms at P
. The same procedure is applied to all ligand atoms summing them up. The pseudocode is shown in Algorithm 3, where N_simulations is the number of simulations, nlig is the number of atoms of each ligand and the function interpolate performs the calculation of the electrostatic potential for each atom.
Algorithm 4 GPU pseudocode for the calculation of the electrostatic potential
1: for all nBlocks do
2: dlig = rotate(clig[myAtom], myQuaternion)
3: ilig = shift(myShift, dlig)
4: index = positionToGridCoordinates (ESGridInfo, ilig)
5: energy _shared[myAtom] = charge[myAtom] * accessToTextureMemory(ESGrid, index)
6: totalEnergy = parallelReduction(energy_shared)
7: if threadId == numThreads%nlig then
8: energy[mySimulation] = totalEnergy
9: end if
10: end for
In a previous work , we studied different strategies for the GPU implementation of the previous algorithm applied to many different ligands. In that study, we reported that the use of the texture memory decreases considerably the time needed for the calculation of the interpolation. Therefore, in BINDSURF we decided to use the texture memory to obtain the electrostatic potential by linear interpolation in a 3D grid. The Algorithm 4 shows the pseudocode of the ES kernel, where clig, charge and nlig is the ligand model (atom positions and charges, and number of atoms); myAtom, myQuaternion and myShift are the atom assigned to the thread and the rotation and shift belonging to the simulation assigned to the thread mySimulation; ESGridInfo and ESGrid are the grid description and the grid data (the latter is stored in the GPU texture memory) and energy_shared is a auxiliary vector in shared memory.
Each thread calculates the energy of only one atom. Firstly, each thread has to obtain the current atom position from the ligand model and the simulation state using the functions shift and rotation. Then, it calculates the grid position (function positionToGridCoordinates), interpolates the energy value accessing to the texture memory and (function accessToTextureMemory) stores the result in shared memory (line 5). Finally, threads of the same simulation sum up their results by a parallel reduction (line 6) with complexity order O(log(n)) and one of these threads writes the final result in global memory (lines 7-8).
Van der Waals (VDW) and Hydrogen Bonds (HBOND) energies calculation
Algorithm 5 Sequential pseudocode for the calculation of the VDW and HBOND energies
1: for i = 1 to N_simulations do
2: for j = 1 to nlig do
3: index = positionToGridCoordinates(V DWGRidI n f o, j)
4: for k = 0 to numNeighbours(V DWGRid[index]) do
5: vdwTerm+ = vdwEnergy (j, V DWGRid [index][k])
6: hbondTerm+ = hbondEnergy (j, V DWGrid[index][k])
7: end for
8: end for
9: energy[i * nlig + j] = vdwTerm + hbondTerm;
10: end for
The precomputed VDW grid is generated by the program GEN_GRID as described in  and afterwards, it is read by SURF _SCREEN from file and loaded onto memory. Next, the Van der Waals (VDW) energy of each atom is calculated using the expression explained before and following the pseudocode shown in Algorithm 5, where N _simulations is the number of simulations, nlig is the number of atoms of the ligand and the functions vdwEnergy and hbondEnergy performs the calculation of the Van der Waals and hydrogen bonds potentials for each pair of atoms.
The Algorithm 6 shows the pseudocode of the VDW kernel, where V DWGridInfo and V DWGrid are the grid description and the grid data, both stored in the GPU global memory. Other variables have the same meaning than in Algorithm 4. Each thread calculates the energy of only one atom. In the same way as the previous kernel, each thread applies the rotation and displacement corresponding to the simulation over the ligand model in order to obtain the current atom position (lines 2-3). Then, it calculates the grid position, calculates the VDW and HBOND potentials using the neighbors stored in the VDW grid and stores the result in shared memory (lines 4-9) . The parameters needed by the VDW and HBOND energies are previously stored in the GPU constant memory. Finally, threads of the same simulation sum up their results by a parallel reduction (line 10) and one of these threads accumulates the final result in global memory (lines 11-12).
Algorithm 6 GPU pseudocode for the calculation of the VDW and HBOND energies
1: for all nBlocks do
2: rlig = rotate(clig[myAtom], myQuaternion)
3: ilig = shift(myShift, rlig)
4: index = positionToGridCoordinates(V DWGridI n f o, ilig)
5: for k = 0 to numNeighbours(V DWGRid[index]) do
6: vdwTerm+ = vdwEnergy(j, V DWGRid[index][k])
7: hbondTerm+ = hbondEnergy(j, V DWGrid[index][k])
8: end for
9: energy _shared[myAtom] = vdwTerm + hbondTerm
10: totalEnergy = parallelReduction(energy _shared)
11: if threadId == numThreads%nlig then
12: energy[mySimulation]+ = totalEnergy
13: end if
14: end for 1