High-performance blob-based iterative three-dimensional reconstruction in electron tomography using multi-GPUs
© Wan et al; licensee BioMed Central Ltd. 2012
Published: 25 June 2012
Skip to main content
© Wan et al; licensee BioMed Central Ltd. 2012
Published: 25 June 2012
Three-dimensional (3D) reconstruction in electron tomography (ET) has emerged as a leading technique to elucidate the molecular structures of complex biological specimens. Blob-based iterative methods are advantageous reconstruction methods for 3D reconstruction in ET, but demand huge computational costs. Multiple graphic processing units (multi-GPUs) offer an affordable platform to meet these demands. However, a synchronous communication scheme between multi-GPUs leads to idle GPU time, and a weighted matrix involved in iterative methods cannot be loaded into GPUs especially for large images due to the limited available memory of GPUs.
In this paper we propose a multilevel parallel strategy combined with an asynchronous communication scheme and a blob-ELLR data structure to efficiently perform blob-based iterative reconstructions on multi-GPUs. The asynchronous communication scheme is used to minimize the idle GPU time so as to asynchronously overlap communications with computations. The blob-ELLR data structure only needs nearly 1/16 of the storage space in comparison with ELLPACK-R (ELLR) data structure and yields significant acceleration.
Experimental results indicate that the multilevel parallel scheme combined with the asynchronous communication scheme and the blob-ELLR data structure allows efficient implementations of 3D reconstruction in ET on multi-GPUs.
Electron tomography (ET) combines electron microscopy (EM) and tomographic imaging to elucidate three-dimensional (3D) descriptions of complex biological structures at molecular resolution . In ET, a series of projection images are taken with an electron microscope from a unique biological sample at different orientations around a single tilt axis . From those projection images, the 3D structure of the sample can be obtained by means of tomographic reconstruction algorithms . Weighted backprojection (WBP) is a standard reconstruction method in the field of 3D reconstruction in ET, due to its algorithmic simplicity and computational efficiency . The major disadvantage of WBP, however, is that the results may be strongly affected by limited-angle data and noisy conditions . Iterative methods, such as Simultaneous Iterative Reconstruction Technique (SIRT), are one of the main alternatives to WBP in 3D reconstruction in ET, owing to their good performance in handling incomplete, noisy data [6, 7]. Furthermore, blob-based iterative methods show a better performance than voxel-based ones in the incomplete and noisy conditions . However, they have not been extensively used due to their high computational cost . Furthermore, the need for high resolution makes ET of complex biological specimens use large projection images, which also yields large reconstructed files and requires an extensive use of computational resources and considerable processing time [8, 9].
3D reconstruction in ET demands huge computational costs and resources that derive from the computational complexity of the reconstruction algorithms and the size and number of the projection images involved. Traditionally, high-performance computing  has been used to address such computational requirements by means of parallel computing on supercomputers , large computer clusters  and multicore computers . Recently, graphics processing units (GPUs) offer an attractive alternative platform to cope with the demands in ET in terms of the high peak performance, cost effectiveness, and the availability of user-friendly programming environments, e.g. NVIDIA CUDA [12, 13]. Several advanced GPU acceleration frameworks have been proposed to allow 3D-ET reconstruction to be performed on the order of minutes [14, 15]. However, these parallel reconstructions on GPUs only adopt traditional voxel basis functions which are less robust than blob basis functions under noisy situations. Some previous work focuses on the blob-based iterative reconstruction on a single GPU, which is still time-consuming [16, 17]. Single GPU cannot meet the requirements of the computational resources and the memory storage of 3D-ET reconstructions with the size of the projection images increasing constantly (typically 2 k × 2 k or even 4 k × 4 k). The architectural notion of a CPU serviced by multi-GPUs is an efficient solution for parallel 3D-ET reconstruction due to increasing the power of computations and the storage of memory.
Achieving efficient blob-based iterative reconstructions on multi-GPUs is challenging: because of the overlapping nature of blobs, the use of blobs as basis functions needs the communication between multiple GPUs during the process of iterative reconstructions. CUDA provides a synchronous communication scheme to handle the communication between GPUs . But the downside of the synchronous communication is that each GPU must stop and sit idle while data is exchanged. The idle sit of GPU is a waste of resources which has a negative impact on the performance of reconstructions on multi-GPUs. Furthermore, as data collection strategies and electron detectors improve, a sparse weight matrix involved in blob-based iterative reconstruction methods needs large memory storage. Due to the limited available memory, it is infeasible to store such a large sparse matrix for most GPUs. Computing the weight matrix on the fly is more efficient than storing the matrix in the previous GPU-based ET implementations . But it could bring the redundant computations since the weighted matrix has to be computed twice at least in each iteration.
To address the problems discussed above, in this paper, we make the following contributions: first, we present a multilevel parallel strategy for blob-based iterative reconstructions in ET on multi-GPUs, which can significantly accelerate 3D reconstructions in ET. Second, we develop an asynchronous communication scheme on multi-GPUs to minimize idle GPU time by asynchronously overlapping communications with computations. Finally, a data structure named blob-ELLR adopting three symmetric optimization techniques is developed to significantly reduce the storage space of the weight matrix. It only needs nearly 1/16 of the storage space in comparison with ELLPACK-R (ELLR). Also, the blob-ELLR format can achieve optimal coalesced access to global memory, which is suitable for 3D-ET reconstructions on multi-GPUs. Furthermore, we implement all the above techniques on the two different platforms: a NVIDIA GeForce GTX295 and two NVIDIA Tesla C2050s respectively. Experimental results show that the parallel strategy greatly reduces memory requirements and exhibits a significant acceleration.
In ET, the projection images are acquired from a specimen through the so-called single-axis tilt geometry. The specimen is tilted over a range, typically from -60° (or -70°) to +60° (or +70°) due to physical limitations of microscopes, with small tilt increments (1° or 2°). An image of the same object area is recorded at each tilt angle and then the 3D reconstruction of the specimen can be obtained from a set of projection images by means of blob-based iterative methods. In this section, we give a brief overview of blob-based iterative reconstruction methods, describe an iterative method called SIRT, and present a GPU computational model.
where I m (·) denotes the modified Bessel function of the first kind of order m, a is the radius of the blob, α is a non-negative real number controlling the shape of the blob. The choice of the parameters m, a, and α will influence the quality of the blob-based reconstructions. The basis functions that developed in  are used for the choice of the parameters in our algorithm (i.e., a = 2, m = 2 and α = 3.6).
where rf ij is the projected value of the pixel x j at an angle θ i . W is defined as a sparse matrix with M rows and N columns where w ij is the element of W. In general, the storage demand of the weighted matrix W rapidly increases as the size and the number of projection images increase. For example, when the size of images is 2 k × 2 k, the storage demand of the weighted matrix approaches to 3.5 GB. It is hard to store such a large matrix in the most GPUs due to the limited memory of GPUs.
Under those assumptions, the image reconstruction problem can be modelled as the inverse problem of estimating the x j 's from the p i 's by solving the system of linear equations given by Eq. (3). This problem is usually resolved by means of iterative methods.
SIRT is a kind of all simultaneous iterative methods to solve the linear system which appears in image reconstruction. All simultaneous iterative methods (such as SIRT , component averaging methods (CAV) ) utilize the projection in the all directions to refine the current reconstruction in each iteration so that they are well suited for parallel computing on GPUs. In our work, we adopt SIRT to perform parallel reconstruction on multi-GPUs.
SIRT produces fairly smooth reconstruction results but requires for convergence a large number of iterations since SIRT adopts a global strategy: an approximation is updated simultaneously by all the projection images . SIRT updates each x j only once per iteration, which means its updating strategy is pixel-by-pixel.
Our algorithm is based on NVIDIA GPU architecture and compute unified device architecture (CUDA) programming model. GPU is a massively multi-threaded data-parallel architecture. NVIDIA GPUs contain a scalable array of streaming multiprocessors (SMs) each of which contains scalar processors (SPs). On the old Tesla architecture of GPUs, there are 8 SPs per SM while a SM contains 32 SPs in the new Fermi architecture GPUs. All the SPs in the same SM execute the same instructions synchronously in a Single Instruction Multiple Thread (SIMT) fashion . During execution, 32 threads from a continuous section are grouped into a warp, which is the scheduling unit on each SM.
NVIDIA provides the programming model and software environment of CUDA. CUDA is an extension to the C programming language. A CUDA program consists of a host program that runs on CPU and a kernel program that executes on GPU itself. The host program typically sets up data and transfers it to and from the GPU, while the kernel program processes that data. Kernel, as a program on GPUs, consists of thousands of threads. Threads have a three-level hierarchy: grid, block, thread. A grid is a set of blocks that execute a kernel, and each block consists of hundreds of threads. All threads within a block can share the same on-chip memory and can be synchronized at a barrier. Each block can only be assigned to and executed on one SM.
CUDA provides a synchronous communication scheme (i.e. cudaThreadSynchronize()) to handle the communication between GPUs. With the synchronous scheme, all of threads on GPUs must be blocked until the data communication has been completed. CUDA devices use several memory spaces including global, local, shared, texture, constant memory and registers. Of these different memory spaces, global memory is the most plentiful. Global memory loads and stores by a half warp (16 threads) are coalesced in as few as one transaction (or two transactions in the case of 128-bit words) when certain access requirements are met. Coalesced memory accesses deliver a much higher efficient bandwidth than non-coalesced accesses, thus greatly affecting the performance of bandwidth-bound programs.
The processing time of 3D reconstruction with blob-based iterative methods is a major challenge in ET due to large reconstructed data volume. So parallel computing on multi-GPUs is becoming paramount to cope with the computational requirement. We present a multilevel parallel strategy for blob-based iterative reconstruction and implement it on the OpenMP-CUDA architecture.
In the first level of the multilevel parallel scheme, a coarse-grained parallelization is straightforward in line with the properties of ET reconstruction. The single-tilt axis geometry allows data decomposition into slabs of slices orthogonal to the tilt axis. For this decomposition, the number of slabs equals to the number of GPUs, and each GPU reconstructs its own slab. Consequently, the 3D reconstruction problem can be decomposed into a set of 3D slabs reconstruction sub-problems. However, the slabs are interdependent due to the overlapping nature of blobs. Therefore, each GPU has to receive a slab which is composed of its corresponding own slices and additional redundant slices reconstructed in neighbour slabs. The number of redundant slices depends on the blob extension. In a slab, the own slices are reconstructed by the corresponding GPU and require information provided by the redundant slices from the neighbour GPUs. During the process of 3D-ET reconstruction, each GPU has to communicate with other GPUs for the additional redundant slices.
In the second level of the multilevel parallel scheme, 3D reconstruction of one slab, as a fine-grained parallelization, is implemented on each GPU using CUDA. In the 3D reconstruction of a slab, the generic iterative process is described as follows:
Initialization: compute the matrix W and make a initial value for X (0) by BPT;
Reprojection: estimate the computed projection data P' based on the current approximation X;
Backprojection: backproject the discrepancy ΔP between the experimental and calculated projections, and refine the current approximation X by incorporating the weighted backprojection ΔX.
As described above in the multilevel parallel scheme, there must be two communications between neighbour GPUs during one iterative reconstruction process. One is to exchange the computed projections of the redundant slices after the reprojection process. The other is to exchange the reconstructed pixels of the redundant slices after the backprojection process. CUDA provides a synchronous communication scheme (i.e. cudaThreadSynchronize()) to handle the communication between GPUs. With the synchronous communication scheme, GPUs must sit idle while data is exchange, which has a negative impact on the performance of the reconstruction in ET.
In the parallel blob-based iterative reconstruction, another problem is the lack of memory on GPUs for the sparse weighted matrix. Recently, several data structures have been developed to store sparse matrices. Compressed row storage (CRS) is the most extended format to store the sparse matrix on CPUs . ELLPACK can be considered as an approach to outperform CRS . Vazquez et al. proposed and evaluated a variant of the ELLPACK format called ELLPACK-R (ELLR) . ELLR has been proved to outperform the most efficient formats for storing the sparse matrix data structure on GPUs . ELLR consists of two arrays, A and I of dimension N × MaxEntriesbyRows, and an additional N-dimensional integer array called rl is included in order to store the actual number of nonzeroes in each row [13, 28]. With the size and number of projection images increasing, the memory demand of the sparse weighted matrix rapidly increases. The weighted matrix involved is too large to load into most of GPUs due to the limited available memory, even with the ELLR data structure.
Although the blob-ELLR without symmetric techniques can reduce the storage of the sparse matrix W, the number of (4B) × N is rather large especially when the number of N increases rapidly. The optimization takes advantage of the symmetry relationships as follows:
So, only w 1j is stored in the blob-ELLR, whereas the other elements are easily computed based on w 1j . This scheme can reduce the storage spaces of A and I to 25%.
It is easy to see that the point (-x,-y) of a slice is then projected to a point r1 (r1 = -r) in the same tilt angle θ. The weighted value of the point (-x,-y) can be computed according to that of the point (x, y). Therefore, it is not necessary to store the weighted value of almost a half of the points in the matrix W so that the space requirements for A and I are further reduced by nearly 50%.
In general, the tilt angles used in ET are halved by 0°. Under the condition, a point (-x, y) with a tilt angle -θ is projected to a point r2 (r2 = -r). Therefore, the projection coefficients are shared with the projection of the point (x, y) with the tilt angle θ. This further reduces the storage spaces of A and I by nearly 50% again.
With the three symmetric optimizations mentioned above, the size of the storage for two arrays in the blob-ELLR is almost (B/2) × (N/2) reducing to nearly 1/16 of original size.
In order to evaluate the performance of the multilevel parallel strategy, the blob-based iterative reconstructions of the caveolae from the porcine aorta endothelial (PAE) cell have been performed . Three different experimental datasets are used (denoted by small-sized, medium-sized, large-sized) with 56 images of 512 × 512 pixels, 112 images of 1024 × 1024 pixels, and 119 images of 2048 × 2048 pixels, to reconstruct tomograms of 512 × 512 × 190, 1024 × 1024 × 310 and 2048 × 2048 × 430 respectively. All the experiments are carried out on both GT200 and Fermi platforms. The details of the platforms are as follows. The GT200 machine consist of a 2.66 GHz Intel Xeon X5650, 24 GB RAM, and a NVIDIA GeForce GTX 295 card including two Tesla GPUs, each containing 30 SMs of 8 SPs (i.e. 240 SPs) at 1.2 GHz, 896 MB of memory. The Fermi machine is composed of the same CPU based on Intel Xeon X5650, and two NVIDIA Tesla C2050 cards. NVIDIA Tesla C2050 adopts the Fermi architecture and contains 14 SMs of 32 SPs (i.e. 448 SPs) at 1.15 GHz, 3 GB of memory. The two machines are both running on Redhat EL5 64-bit. For the comparison of the performance of multi-GPUs with CPU, we have performed the related serial program on the same CPU, i.e. Intel Xeon X5650, with a single core. To clearly evaluate the performance of the asynchronous communication scheme and the blob-ELLR data structure respectively, we have performed two sets of experiments. The details of the experiments are introduced below.
Running times (s)
512 × 512
1024 × 1024
2048 × 2048
512 × 512
1024 × 1024
2048 × 2048
ET allows elucidation of the molecular architecture of complex biological specimens. Blob-based iterative methods yield better results than other methods, but are not used extensively in ET because of their huge computational demands. Multi-GPUs have emerged as powerful platforms to cope with the computational requirements, but have the difficulties due to the synchronous communication and limited memory of GPUs. In this work, we present a multilevel parallel strategy combined with an asynchronous communication scheme and a blob-ELLR data structure to perform high-performance blob-based iterative reconstruction in ET on multi-GPUs. The asynchronous communication scheme is used to minimize the idle GPU time. The blob-ELLR structure only needs nearly 1/16 of the storage space in comparison with the ELLR storage structure and yields significant acceleration compared to the standard and ELLR matrix methods. In this work, adopting the multilevel parallel strategy with the asynchronous communication scheme and the blob-ELLR data structure, we have performed the parallel 3D-ET reconstruction using SIRT on multi-GPUs. In fact, the parallel strategy proposed can be also easily applied to the other simultaneous methods, e.g. CAV. In the future work, we will further investigate and implement the multilevel parallel strategy and the asynchronous communication scheme on a many-GPU cluster.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 10, 2012: "Selected articles from the 7th International Symposium on Bioinformatics Research and Applications (ISBRA'11)". The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S10.
We would like to thank Prof. Fei Sun and Dr. Ka Zhang in Institute of biophysics for providing the experimental datasets. Work is supported by grants National Natural Science Foundation for China (61103139, 61070129 and 61003164); National Grand Fundamental Research 973 Program of China (2011CB302501) and National Core-High Tech-basic Program (2011ZX01028-001-002).