GPU-accelerated iterative reconstruction for limited-data tomography in CBCT systems

Background Standard cone-beam computed tomography (CBCT) involves the acquisition of at least 360 projections rotating through 360 degrees. Nevertheless, there are cases in which only a few projections can be taken in a limited angular span, such as during surgery, where rotation of the source-detector pair is limited to less than 180 degrees. Reconstruction of limited data with the conventional method proposed by Feldkamp, Davis and Kress (FDK) results in severe artifacts. Iterative methods may compensate for the lack of data by including additional prior information, although they imply a high computational burden and memory consumption. Results We present an accelerated implementation of an iterative method for CBCT following the Split Bregman formulation, which reduces computational time through GPU-accelerated kernels. The implementation enables the reconstruction of large volumes (>10243 pixels) using partitioning strategies in forward- and back-projection operations. We evaluated the algorithm on small-animal data for different scenarios with different numbers of projections, angular span, and projection size. Reconstruction time varied linearly with the number of projections and quadratically with projection size but remained almost unchanged with angular span. Forward- and back-projection operations represent 60% of the total computational burden. Conclusion Efficient implementation using parallel processing and large-memory management strategies together with GPU kernels enables the use of advanced reconstruction approaches which are needed in limited-data scenarios. Our GPU implementation showed a significant time reduction (up to 48 ×) compared to a CPU-only implementation, resulting in a total reconstruction time from several hours to few minutes.


Background
The source-detector pair in conventional cone beam computed tomography (CBCT) systems rotates around the patient through 360 degrees (full angular span) to acquire at least 360 projections. However, there are cases in which the number of projections acquired is smaller and/or covers a smaller angular span (down to 150 degrees) owing to movement limitations, as occurs during surgery, or in respiratory-gated CT, where only a few projections correspond to each gate. The reconstruction of these limited data with the conventional method proposed by Feldkamp, Davis and Kress (FDK) results in severe artifacts in the image (streaks and/or edge distortion), making it advisable to use advanced reconstruction methods that compensate for the lack of data by including prior information about the sample. The most common option for prior information is the assumption of local smoothness, which can be imposed by adding the minimization of the L 1 norm of the total variation (TV) term. Since the TV term is not differentiable, the use of traditional reconstruction methods may be subject to instability problems [1]. In [2], the authors showed that reconstructing limited data in CT can be effi-ciently solved by means of the Split Bregman formulation, which reduces the optimization problem to a sequence of unconstrained and simpler problems that are updated iteratively.
In a previous work [3], we presented a new reconstruction method based on the Split Bregman formulation. We reported significant image improvement in terms of artifact reduction using this approach for limited-data CBCT, as compared with FDK. We have presented two implementations of this algorithm, one combining MATLAB and CUDA [3] and another one based on a CPU distributed version [4]. The main limitation of both solutions is that only 2D images can be reconstructed owing to computational and memory requirements. Reconstruction of 3D images with these methods was not possible for two main reasons: (1) memory requirements of the algorithm, and (2) long execution times which hinder the reconstruction of standard size volumes in a reasonable amount of time.
Another example of using MATLAB and CUDA is the work by Smith et al. [5], an iterative reconstruction method based on Split Bregman for MRI. The main limitation of this work is that the communication between MATLAB and GPUs is done through an intermediate library, which increases the overhead with respect to programming in native languages. Furthermore, MRI reconstruction uses FFT (Fast Fourier Transform), which is computationally less expensive than the projection and backprojection kernels needed in CT reconstruction.
Other works presented CPU-GPU implementations using native languages for FDK [6][7][8], a reconstruction method less challenging than iterative reconstruction. Regarding iterative reconstruction algorithms, which include several projection and backprojection operations, techniques employed for parallelization highly affect the reconstruction execution time as shown in [9], obtaining a speedup factor between 50× and 200× using two GPUs with respect to the execution of the same algorithm in a single-thread CPU. Hu et al. [10] proposed an advanced multi-resolution approach to reduce the total execution time. Nevertheless, this work was applied to full span data with a high number of projections. Focusing on the problem of limited data, Jia et al. [11] proposed a new iterative method but they did not address the problem of handling large volumes. A more recent work by Matenine et al. [12] presented a solution for reduced number of projections, but the authors commented the limitation by the memory capacity of the GPUs. Nevertheless, none of these works addressed the problem of limited angular span.
In this work, we present an accelerated implementation for limited data both in angular span and number of projections, that uses the GPU for the most time-consuming operations. Our solution includes a partitioning strategy to be able to handle large volumes with a total footprint of several GB.

Algorithm
The reconstruction problem follows the TV minimization [13]: where ∇(u) 1 corresponds to the L 1 norm of the gradient of the reconstructed image u, A is the system matrix, f is the acquisition data, σ 2 is the image noise, and is the subspace corresponding to the field of view (FOV).
Using the Split Bregman formulation [1], the L 1constrained optimization problem shown in Eq. (1) can be converted into the following unconstrained problems, which are solved at each iteration k: (2) where μ and λ are regularization parameters. Equation (2) can be split into two sub-problems. The first sub-problem contains only differentiable L 2 -norm terms. By differentiating with respect to u and setting the result to 0, we obtain the following problem: which can be summarized in the following problem: which is solved iteratively using a Krylov space solver, namely, the biconjugate gradient stabilized method. In this step, an input parameter β controls the stability of the problem. The second sub-problem contains L 1 terms that are not differentiable. Therefore, it is tackled using analytical formulas (shrinkage operation), which need two additional input parameters α and λ. Finally, Eqs. (3,4,5) are the Bregman iterations that impose constraints for acquired data and total variation, respectively.
The Krylov space solver is implemented with the biconjugate gradient stabilized method, BiCGStab [15], where the input matrix in Eqs. (6,7) is substituted by the Algorithm 2.

Results
The method was evaluated in a computer with two Intel(R) Xeon(R) E5-2630 v3 processors at 2.40 GHz and one NVidia Tesla K40c GPU. Limited-data acquisitions (DimProj × DimProj × NumProjs pixels) were simulated from a previously acquired small-animal scan (512×512× 512 pixels; 0.125 mm pixel size), as shown in Fig. 3, left. We studied the following parameters: dependency on the number of projections with NumProj = 45, 60, 90, and 120 covering an angular span of 360 degrees and DimProj = 512; dependency on angular span for NumProj = 45 uniformly distributed in an angular span of 45, 60, 90, 135, 150, 180, and 270 degrees (DimProj = 512); and the effects of the projection size, by considering DimProj = 256, 512, and 1024 when 90 projections are obtained uniformly distributed in an angular span of 360 degrees. All simulations were generated using FUX-SIM [16], a simulation/reconstruction framework for X-ray systems.
These data were reconstructed with an FDK-based method [17] and the proposed iterative method resulting in a volume of DimProj × DimProj × DimProj pixels. For the latter, we used α = 0.003, μ = 20, β = 3, and λ = 2 as reconstruction parameters (see [2] for details on how to select these parameters). The number of iterations (iterations in Algorithm 1, line 4) was 35, selected high enough to ensure an error variation smaller than 1%. Figure 3 shows the reference image (FDK reconstruction of the complete dataset) and the results of FDK and the proposed iterative method for two limited-data configurations. Image quality was assessed with two metrics. To evaluate the global image quality, we calculated the root mean square error (RMSE) between the reference image and the intermediate solution u k from the limited dataset. To evaluate the influence of streaks and noise in the reconstructed image, we measured the improvement of signal to noise ratio (SNR) obtained with the iterative method with respect to the FDK-based method in the homogeneous area indicated in Fig. 3. Table 1 shows both metrics for different number of projections and angular span, with a noticeable improvement when increasing the angular span despite the low number of projections. Figure 4 plots the dependence of the RMSE with the number of iterations, k, for six different limited-data cases varying the angular span and the number of projections. We can see that the proposed iterative method shows a similar behaviour for all limited-data configurations. Figures 5, 6, and 7 show the breakdown of the reconstruction time of each configuration, obtained as the average of three consecutive executions in order to avoid time variability due to operating system operations. Reconstruction time is divided into backprojection, forward projection, and time spent in other operations including I/O operations and CPU computation.
Finally, we compared our implementation in GPU of the iterative method with a CPU-only implementation of the same iterative method parallelized using OpenMP to fully exploit multi-core architectures. Figures 8 and 9 plot the time spent in the first iteration (average of three different executions) reaching a speedup factor of 48× with the GPU implementation with respect with the CPU-only one.

Discussion and conclusions
We present an accelerated implementation of a method for 3D limited-data tomography solved in an efficient way by using a GPU for the most time-consuming operations.
Our evaluation of the method showed a high reduction of the severe artifacts present when using the conventional FDK-based method for cases with low number of projections, with an SNR improvement better than 20 dB for all cases. The image distortion due to the limited angular span was also reduced with the proposed method.
To evaluate the performance of the implementation according to data size, we fixed a high number of iterations (iterations = 35) for all experiments in order to ensure optimum image quality for the worst conditions.  Reconstructions of large studies (volume of 1024 × 1024 × 1024 pixels) are feasible with this accelerated implementation of the iterative method thanks to the partitioning strategy followed for both backprojection and forward projection operations. Our GPU implementation showed significant time reduction (up to 48×) compared with a CPU-only implementation, resulting in a decrease of the total reconstruction time from several hours to few minutes. A fair comparison with other iterative reconstruction implementations proposed in the literature is not feasible owing to differences in the specific algorithms and the hardware used. Nevertheless, we note that the work by Matenine et al. [12], which is the most similar to our solution, was limited by the memory capacity of the GPUs and did not address the problem of limited angular span. In contrast, our GPU accelerated algorithm obtains similar results in terms of execution time despite the fact that it works with large detector and reconstructed volume sizes with a low number of projections in a limited angular span, which increase significantly the number of iterations needed for convergence. Regarding our previous implementations of the same algorithm, the implementation we propose substantially reduces reconstruction time and hardware resources. As previously reported [3], a solution combining MATLAB and CUDA kernels required a large amount of memory transfers between the CPU and the GPU, resulting in increased execution times, which is unfeasible for the large 3D volumes in real scenarios. This problem is partially solved here due to the use of native code and explicit memory transfers. On the other hand, the complete CPU-based implementation presented in [4] required a high volume of distributed resources to obtain acceptable execution times. For example, using 12 compute nodes resulted in more than 1,000 seconds for a volume of 512 × 512 × 512 pixels for only 2 iterations of the algorithm, which is insufficient to obtain a high-quality image.
Efficient implementation using parallel processing and large-memory management strategies together with GPU