Multi-dimensional characterization of electrostatic surface potential computation on graphics processors
© Daga and Feng; licensee BioMed Central Ltd. 2012
Published: 12 April 2012
Skip to main content
© Daga and Feng; licensee BioMed Central Ltd. 2012
Published: 12 April 2012
Calculating the electrostatic surface potential (ESP) of a biomolecule is critical towards understanding biomolecular function. Because of its quadratic computational complexity (as a function of the number of atoms in a molecule), there have been continual efforts to reduce its complexity either by improving the algorithm or the underlying hardware on which the calculations are performed.
We present the combined effect of (i) a multi-scale approximation algorithm, known as hierarchical charge partitioning (HCP), when applied to the calculation of ESP and (ii) its mapping onto a graphics processing unit (GPU). To date, most molecular modeling algorithms perform an artificial partitioning of biomolecules into a grid/lattice on the GPU. In contrast, HCP takes advantage of the natural partitioning in biomolecules, which in turn, better facilitates its mapping onto the GPU. Specifically, we characterize the effect of known GPU optimization techniques like use of shared memory. In addition, we demonstrate how the cost of divergent branching on a GPU can be amortized across algorithms like HCP in order to deliver a massive performance boon.
We accelerated the calculation of ESP by 25-fold solely by parallelization on the GPU. Combining GPU and HCP, resulted in a speedup of at most 1,860-fold for our largest molecular structure. The baseline for these speedups is an implementation that has been hand-tuned SSE-optimized and parallelized across 16 cores on the CPU. The use of GPU does not deteriorate the accuracy of our results.
Electrostatic interactions in a molecule are of utmost importance for analyzing its structure [1–3] as well as functional activities like ligand binding , complex formation  and proton transport . The calculation of electrostatic interactions continues to be a computational bottleneck primarily because they are long-range by nature of the potential . As a consequence, efficient approximation algorithms have been developed to reduce this computational complexity (e.g., the spherical cut-off method , the particle mesh Ewald (PME) method , the fast multipole method  and the hierarchical charge partitioning (HCP) ). The approximation algorithms can be parallelized on increasingly ubiquitous multi- and many-core architectures to deliver even greater performance benefits.
Widespread adoption of general-purpose graphics processing units (GPUs) has made them popular as accelerators for parallel programs . The increased popularity has been assisted by (i) phenomenal computing power, (ii) superior performance/dollar ratio, and (iii) compelling performance/watt ratio. For example, an 8-GPU cluster, costing a few thousand dollars, can simulate 52 ns/day of the JAC Benchmark as compared to 46 ns/day on the Kraken supercomputer, housed at Oak Ridge National Lab and which costs millions of dollars . The emergence of GPUs as an attractive high-performance computing platform is also evident from the fact that three out of the top five fastest supercomputers on the Top500 list employ GPUs .
Although the use of approximation algorithms can improve performance, they often lead to an increase in the memory boundedness of the application. Achieving optimum performance with a memory-bound application is challenging due to the 'memory wall' . The effect of the memory wall is more severe on GPUs because of the extremely high latency for global memory accesses (on the order of 600 - 800 cycles). Furthermore, for maximum performance on the GPU, execution paths on each GPU computational unit need to be synchronized. However, an important class of approximation algorithms, i.e., multi-scale approximations result in highly asynchronous execution paths due to the introduction of a large number divergent branches, which depend upon the relative distances between interacting atoms.
To test these expectations, we present a hybrid approach wherein we implement the robust multi-scale HCP approximation algorithm in a molecular modeling application called GEM  and map it onto a GPU. We counteract the high memory boundedness of HCP by explicitly managing the data movement, in a way that helps us achieve significantly improved performance. In addition, we employ the standard GPU optimization techniques, such as coalesced memory accesses and the use of shared memory, quantifying the effectiveness of each optimization in our application. HCP results in supreme performance on the GPU despite the introduction of divergent branches. This is attributed to the reduction in memory transactions that compensates for divergent branching.
Recently, several molecular modeling applications have used the GPU to speed-up electrostatic computations. Rodrigues et al.  and Stone et al.  demonstrate that the estimation of electrostatic interactions can be accelerated by the use of spherical cut-off method and the GPU. In , Hardy et al. used a multi-scale summation method on the GPU. Each of the aforementioned implementations artificially maps the n atoms of a molecule onto a m-point lattice grid and then applies their respective approximation algorithm. By doing so, they reduce the time complexity of the computation from O(nn) to O(nm). In contrast, we use HCP, which performs approximations based on the natural partitioning of biomolecules. The advantage of using the natural partitioning is that even with the movement of atoms during molecular dynamics simulations, the hierarchical nature is preserved, whereas with the lattice, atoms may move in and out of the lattice during the simulation. Our implementation realizes a maximum of 1,860-fold speedup over a hand-tuned SSE optimized implementation on a modern 16-core CPU, without any loss in the accuracy of the results.
Computing the potential at P vertices results in a time complexity of O(NP) where N is the number of atoms in the molecule. To reduce the time complexity, we apply an approximation algorithm called hierarchical charge partitioning (HCP), which reduces the upper bound of computation to O(P log N).
For this study, we have used state-of-art NVIDIA GPUs based on the Compute Unified Device Architecture or CUDA framework. CUDA is a framework developed by NVIDIA, which facilitates the implementation of general-purpose applications on GPUs. Below is a brief description of the NVIDIA GPU hardware architecture and the CUDA programming interface.
On NVIDIA GPUs, threads are organized into groups of 32, referred to as a warp. When threads within a warp follow different execution paths, such as when encountering a conditional, a divergent branch takes place. Execution of these group of threads is serialized, thereby, affecting performance. On a GPU, computations are much faster compared to a typical CPU, but memory accesses and divergent branching instructions are slower. The effect of slower memory access and divergent branching can be mitigated by initiating thousands of threads on a GPU, such that when one of the threads is waiting on a memory access, other threads can perform meaningful computations.
Every GPU operates in a memory space known as global memory. Data which needs to be operated on by the GPU, needs to be first transferred to the GPU. This process of transferring data to GPU memory is performed over the PCI-e bus, making it an extremely slow process. Therefore, memory transfers should be kept to a minimum to obtain optimum performance. Also, accessing data from the GPU global memory entails the cost of 400-600 cycles and hence, on-chip memory should be used to reduce global memory traffic. On the GT200 architecture, each SM contains a high-speed, 16 KB, scratch-pad memory, known as the shared memory. Shared memory enables extensive re-use of data, thereby, reducing off-chip traffic. Whereas on the latest Fermi architecture, each SM contains 64 KB of on-chip memory, which can be either be configured as 16 KB of shared memory and 48 KB of L1 cache or vice versa. Each SM also consists of a L2 cache of size 128 KB. The hierarchy of caches on the Fermi architecture allows for more efficient global memory access patterns.
CUDA provides a C/C++ language extension with application programming interfaces (APIs). A CUDA program is executed by a kernel, which is effectively a function call to the GPU, launched from the CPU. CUDA logically arranges the threads into blocks which are in turn grouped into a grid. Each thread has its own ID which provides for one-one mapping. Each block of threads is executed on a SM and share data using the shared memory present.
The problem of computing molecular surface potential is inherently data parallel in nature, i.e., the potential at one point on the surface can be computed independently from the computation of potential at some other point. This works to our advantage as such applications map very well onto the GPU. We begin with offloading all the necessary data (coordinates of vertices and atoms and the approximated point charges) to the GPU global memory. To ensure efficient global memory accesses patterns, we flattened the data structures. By flattening of data structures we mean that all the arrays of structures were transformed into arrays of primitives so that the threads in a half warp (16 threads) access data from contiguous memory locations [20, 21]. The GPU kernel is then executed, wherein each thread is assigned the task of computing the electrostatic potential at one vertex. At this point the required amount of shared memory, i.e, number of threads in a block times the size of the coordinates of each vertex, is allocated on each streaming multiprocessor (SM) of the GPU. The kernel is launched multiple times as required, until all the vertices are exhausted, with implicit GPU synchronization in between successive kernel launches. On the GPU side, each kernel thread copies the coordinates of its assigned vertex onto the shared memory. This results in a reduction of the number of global memory loads as explained in the Results section. The limited amount of per SM shared memory does not allow us to offload the coordinates of constituent components of the biomolecule and hence, coordinates of complexes, strands, residues, and atoms have to remain in global memory. The HCP algorithm is then applied to compute the electrostatic potential, and the result is stored in the global memory. All the threads perform this computation in parallel, and after the threads finish, the computed potential at each vertex is transferred back to the CPU memory, where a reduce (sum) operation is performed to calculate the total molecular surface potential. According to the algorithm, evaluation of distance between the vertex and molecular components requires each thread to accesses coordinates from the global memory. This implies that potential calculation at each vertex necessitates multiple global memory accesses, which makes HCP memory-bound on the GPU.
HCP also introduces a significant number of divergent branches on the GPU. This phenomenon occurs because for some threads in a warp, it may be possible to apply HCP approximation while for other, it may not be possible to do so. Therefore, these two groups of threads would diverge and follow respective paths, resulting in a divergent branch. In the Results section, we show how the associated costs of divergent branching in HCP on the GPU can be amortized to deliver a performance boost.
Characteristics of input structures
H helix myoglobin, 1MBO
nuclesome core particle, 1KX5
chaperonin GroEL, 2EU1
virus capsid, 1A6C
The Host Machine consists of an E8200 Intel Quad core running at 2.33 GHz with 4 GB DDR2 SDRAM. The operating system on the host is a 64-bit version of Ubuntu 9.04 distribution running the 2.6.28-16 generic Linux kernel. Programming and access to the GPU was provided by CUDA 3.1 toolkit and SDK with the NVIDIA driver version 256.40. For the sake of accuracy of results, all the processes that required a graphical user interface were disabled to limit resource sharing of the GPU.
Overview of GPUs used
Fermi Tesla C2050
Streaming Processor Cores
Streaming Multiprocessors (SMs)
Memory Bus type
Device Memory size
Shared Memory (per SM)
Configurable 48 KB or 16 KB
L1 Cache (per SM)
Configurable 16KB or 48 KB
Double Precision Floating Point Capability
30 FMA ops/clock
256 FMA ops/clock
Single Precision Floating Point Capability
240 FMA ops/clock
512 FMA ops/clock
Special Function Units (per SM)
In this section, we present an analysis of (i) the impact of using shared memory, (ii) the impact of divergent branching, (iii) the speedups realized by our implementation, and (iv) the accuracy of our results. On CPU, the timing information was gathered by placing required time-checking calls around the computational kernel, excluding the I/O required for writing the results. On GPU, the execution time was measured by using the CUDAEventRecord function call. For a fair comparison, time for offloading the data onto the GPU global memory and storing the results back onto the CPU was taken into account along with the time for execution of the kernel. Single precision was used on both platforms. All the numbers presented are an average of 10 runs performed on each platform. For HCP, the 1st-level threshold was set to 10Å and the 2nd-level threshold was fixed at 70Å.
Percentage reduction in the number of global memory loads
H Helix myoglobin
nucleosome core paricle
From the table, we note that the global memory loads are reduced by 50% for all structures, when the HCP approximation is not used. Whereas with HCP, the amount by which loads are reduced varies from structure to structure. This can be reasoned as follows. When no approximation is applied, the coordinates of vertices and that of all atoms are accessed from global memory, which requires cycling through the residue groups. Therefore when shared memory is not used, the vertex coordinate is loaded twice, once for residue and once for the atom. While when shared memory is used, it is loaded only once, i.e., for copying into the shared memory, thereby, resulting in a 50% reduction in global memory loads.
But in the case of HCP, the number of times a vertex coordinate is loaded from global memory depends upon the structure. This is because for each structure the effective number of computations to be performed are different. For example, for a structure with 1st level of approximation and no shared memory usage, vertex coordinates would be loaded three times from the global memory - (i) to compute the distance to the complex, (ii) to compute the distance to the strand and (iii) finally to compute the distance to the residue. While with shared memory it would be accessed just once. Similarly, for a structure with no approximation, the vertex would be accessed four times, without using shared memory. Therefore, the table suggests that least number of components could be approximated for the virus capsid, and hence, maximum percentage reduction.
Use of the shared memory resulted in a drastic reduction in the number of global loads and hence, provided about 2.7-fold speedup to our application.
Impact of the HCP approximation on memory transactions and divergent branches
Decrease in # of mem. Transactions
Increase in # of divergent branches
H Helix myoglobin
nucleosome core particle
Figures 6 and 7 present speedups achieved by our implementation on NVIDIA Tesla C1060 and NVIDIA Fermi Tesla C2050 GPUs respectively. Both the figures present speedup over the CPU implementation optimized by hand-tuned SSE Intrinsics and parallelized across 16 cores, without the use of any approximation algorithm. Speedups achieved due to the use of GPU alone as well as that due to the combination of GPU and HCP are presented for all four structures.
From both these figures, we note that speedup due to the GPU alone is almost constant for all three structures barring Mb.Hhelix. This is because Mb.Hhelix is an extremely small structure and it does not requires enough GPU threads for the computation of its molecular surface potential, thereby, leaving the GPU under utilized. This phenomenon is prominent in case of the Fermi Tesla C2050 which actually results in a slowdown due to under-utilization of the GPU. For other structures the threshold of the number of threads is met and almost similar speedup is achieved across both the figures. The observed speedup is around 11-fold on Tesla C1060, whereas on Tesla C2050 the speedup is around 25-fold. The increased speedup on C2050 may be attributed to several architectural differences between Fermi and GT200 GPUs, like the ability for concurrent kernel execution, ECC support and fewer SMs. However, the architectural feature that we feel has the most impact for this algorithm, is the presence of a hierarchy of caches on Fermi, as they allow for greater exploitation of the locality of data. For no approximation, all atoms need to be accessed sequentially, thereby, making the caches play an important role and hence, Fermi Tesla C2050 is deemed to be more effective.
As explained in a previous section, application speedup due to the combination of GPU and HCP increases with the increase in number of memory transactions reduced per divergent branch increased. Therefore, from Table 4, number of memory transactions reduced is maximum for virus Capsid and hence, it attains the maximum speedup. Next highest reduction in the number of memory transactions is for 2eu1 and hence, next highest speedup and so on. Our application manages to achieve up to 1,860-fold speedup with HCP on Tesla C1060 for Capsid while the corresponding speedup on Fermi Tesla C2050 is approximately 1,600-fold. The actual execution time of our implementation on both GPUs is <1 sec.
Speedup achieved with HCP on Tesla C2050 is less than that achieved on Tesla C1060 due to the fact that the algorithm fails to take the advantage of the caches present on Fermi, as before. With HCP, not all memory requests are sequential as coordinates of both atoms and high level components are required, making the caches less potent than before. Speedups achieved across all figures for without_HCP version are almost consistent for both the GPUs, which is because it does not introduce divergent branches. Whereas, the version with HCP, results in divergent branches and also varying amounts of speedups across structures, depending upon how much cost of the divergent branches can be amortized by the corresponding reduction in memory transactions.
Relative RMS (root-mean-squared) error
H helix myoglobin
CPU with HCP
GPU with HCP
nuclesome core particle
CPU with HCP
GPU with HCP
chaperonin GroEL, 2eu1
CPU with HCP
GPU with HCP
CPU with HCP
GPU with HCP
Due to the paltry error introduced by single precision on the GPU, it may be deemed acceptable for the computation of molecular surface potential on the GPU but may be unsatisfactory for molecular dynamics. In case of molecular dynamics simulations, even a minute error in one time step can have a substantial effect on the results as the error would accumulate during the course of the simulation. It is here that superior double precision support of Fermi would come in handy.
With the emergence of GPU computing, there have been many attempts at accelerating the electrostatic surface potential (ESP) computations for biomolecules. In our work, we demonstrate the combined effect of using a multi-scale approximation algorithm called hierarchical charge partitioning (HCP) and mapping it onto a graphics processing unit (GPU). While mainstream molecular modeling algorithms impose an artificial partitioning of biomolecules into a grid/lattice to map it onto a GPU, HCP is significantly different in that it takes advantage of the natural partitioning in biomolecules, which facilitates a data-parallel mapping onto the GPU.
We then presented our methodology for mapping and optimizing the performance of HCP on the GPU when applied to the calculation of ESP. Despite being a memory-bound application, we leveraged many known optimization techniques to accelerate performance. In addition, we demonstrated the effectiveness of the introduction of divergent branching on GPUs when it reduces the number of instructional and memory transactions.
For a fairer comparison between the CPU and GPU, we optimized the CPU implementation by using hand-tuned SSE intrinsics to handle the SIMD nature of the application on the CPU. We then demonstrated a 1,860-fold reduction in the execution time of the application when compared to that of the hand-tuned SSE implementation on the 16 cores of the CPU. Furthermore, we ensured that the use of single-precision arithmetic on the GPU, combined with the HCP multi-scale approximation, did not significantly affect the accuracy of our results.
For future work, we will apply our HCP approximation algorithm to molecular dynamics (MD) simulations on the GPU, given how well it performs in the case of molecular modeling. For MD simulations, the use of double precision is mandatory as the error incurred in each time-step would accumulate over time, thereby immensely affecting the accuracy of the MD results. In addition, we plan to exploit the use of the cache hierarchy on the NVIDIA Fermi to accelerate the memory-bounded aspect of our application.
This work was supported in part by NSF grants CNS-0915861, CNS-0916719 and a NVIDIA Professor Partnership Award. We thank Tom Scogland for helping us with the initial implementation of GPU-GEM and are also grateful to Alexey Onufriev and his group for making us familiar with HCP approximation.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 5, 2012: Selected articles from the First IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2011): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S5.