Volume 15 Supplement 8
Selected articles from the Third IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2013): Bioinformatics
Multicore and GPU algorithms for Nussinov RNA folding
 Junjie Li^{1}Email author,
 Sanjay Ranka^{1} and
 Sartaj Sahni^{1}
DOI: 10.1186/1471210515S8S1
© Li et al.; licensee BioMed Central Ltd. 2014
Published: 14 July 2014
Abstract
Background
One segment of a RNA sequence might be paired with another segment of the same RNA sequence due to the force of hydrogen bonds. This twodimensional structure is called the RNA sequence's secondary structure. Several algorithms have been proposed to predict an RNA sequence's secondary structure. These algorithms are referred to as RNA folding algorithms.
Results
We develop cache efficient, multicore, and GPU algorithms for RNA folding using Nussinov's algorithm.
Conclusions
Our cache efficient algorithm provides a speedup between 1.6 and 3.0 relative to a naive straightforward single core code. The multicore version of the cache efficient single core algorithm provides a speedup, relative to the naive single core algorithm, between 7.5 and 14.0 on a 6 core hyperthreaded CPU. Our GPU algorithm for the NVIDIA C2050 is up to 1582 times as fast as the naive single core algorithm and between 5.1 and 11.2 times as fast as the fastest previously known GPU algorithm for Nussinov RNA folding.
Keywords
Nussinov Multicore CUDA GPU RNABackground
An RNA sequence is a chain of nucleotides from the alphabet {G (guanine), A (adenine), U (uracil), C (cytosine)}. One segment of a RNA sequence might be paired with another segment of the same RNA sequence due to the force of hydrogen bonds. This twodimensional structure is called the RNA sequence's secondary structure. Two nucleotides in an RNA sequence can form WatsonCrick AU and GC base pairs as well as the GU wobble pair. Several algorithms have been proposed to predict an RNA sequence's secondary structure. These algorithms are referred to as RNA folding algorithms. Waterman and Smith [1] and Nussinov et al. [2] made the first attempt in 1978. Zuker et al. [3] refined Nussinov's algorithm by using a thermodynamic energy minimization model, which produces more accurate results at the expense of greater computational complexity. Although our focus in this paper is the simpler Nussinov's algorithm, our strategies may be applied to Zuker's algorithm as well.
The complexity of Nussinov's and Zuker's algorithm is O(n^{3}), where n is the length of the RNA sequence to be folded. Other RNA folding algorithms with better prediction properties and higher complexity also exist. When folding viral sequences, n ranges from several thousand to several million and singlecore run time becomes excessive and so much effort has gone into developing parallel versions of various RNA folding algorithms. For example, [4, 5] develop a multicore code for an O(n^{4}) folding algorithm while [6] does this for Nussinov's algorithm. [7] develops a framework for RNA folding algorithms on a cluster and tests this framework using an O(n^{6}) (PknotsRE) and an O(n^{4}) (PknotsRG) algorithms for the prediction of RNA secondary structure. FPGA solutions for secondary structure prediction have been proposed in [8–10] and GPU solutions in [11, 12]. We note that [11] is based on the algorithm of Zuker et al. [3] while [12] is based on that of Nussinov et al. [2].
We start in this paper by describing the GPU architecture and programming model. Then we state Nussinov et al.'s [2] dynamic programming recurrence for secondary structure prediction and we give modifications of these equations as obtained by Chang et al. [12]. These modifications simplify the parallelization of the original equations and compute the same results. We also describe the strategy used in [12] to obtain a GPU algorithm to solve the modified equations. A naive implementation of the modified equations of [12] together with a cache efficient version and multicore versions are given. We note that although [6] gives a multicore version of Nussinov's algorithm, our multicore version is much simpler and provides similar speedup. Then our GPU algorithm for the modified equations is described. Experimental and benchmark results are presented after that.
GPU architecture and programming model
Nussinov's dynamic programing recurrence
 1
Add unpaired a_{ i } to the best structure for subsequence a_{ i }_{+1}...a_{ j } , as shown in Figure 2(a).
 2
Add unpaired a_{ j } to the best structure for subsequence a_{ i }...a_{ j }_{−1}, as shown in Figure 2(b).
 3
Add (a_{ i }, a_{ j } ) pair to the best structure for subsequence a_{ i }_{+1}...a_{ j }_{−1}, as shown in Figure 2(c).
 4
Combine two best structures for a_{ i }...a_{ k } and a_{ k }_{+1}...a_{ j }, as shown in Figure 2(d).
Once the C(i, j), 1 ≤ i <j ≤ n, have been computed, a traceback procedure can be used to construct the actual secondary structure, which is represented in the matrix as a path leading to the maximum score. While this traceback procedure takes O(n) time, the actual computation of the C matrix takes O(n^{3}) time.
Simplified recurrence and GPU algorithm
In [12], the entire computation is divided into three stages as shown in Figure 3(b), namely the initial stage, the middle stage and the final stage. In the initial stage, since the computation at each element is shallow (the distance to the central diagonal is short), one GPU thread is assigned to compute one element. No data exchange between threads is needed. All threads synchronize before moving to the next diagonal. In the middle stage, an entire block of threads is assigned to compute one element and the parallel reduction method contained in CUDA SDK is used. In the final stage, all SMs collaborate to compute one element because the distance from the element to the central diagonal is long and the computation for each element is heavy. Again, parallel reduction is used in this stage. To reduce accesses to device memory, the GPU algorithm of [12] stores each C(i, j) value, i <j in positions (i, j) as well as in the otherwise unused position (j, i). When C(j, k + 1) is to be read from device memory (i.e., when it is needed in the right side of Equation 3), the read is done from position (k + 1, j). This changes column reads to row reads and better utilizes the L2 and L1 caches of the target GPU.
Methods
Cache efficient and multicore algorithms
CPU 1 (Algorithm 1) is a naive singlecore algorithm to compute C using the simplified recurrence of Chang et al. This algorithm computes M [i][j] = C(i+1, j+1), 0 ≤ i ≤ j <n, where n is the length of the RNA sequence R.
Algorithm 1 CPU 1
Require: RNA sequence R
Ensure: Array M such that M [i][j] = C(i + 1, j + 1)
1: for i ← 0 to R1 do
2: M [i][i] ← 0
3: end for
4: for i ← 0 to R2 do
5: M [i][i + 1] ← 0
6: end for
7: for diag ← 2 to R1 do
8: for row ← 0 to Rdiag1 do
9: col ← diag + row
10: a ← R[row]; b ← R[col]
11: max ← M [row + 1][col − 1] + bond(a, b)
12: for k ← row to col1 do
13: t ← M [row][k] + M [k + 1][col]
14: max ← max{max, t}
15: end for
16: M [row][col] ← max
17: end for
18: end for
By using the lower triangle of M to store the transpose of the computed values in the upper triangle of M as is done in the GPU implementation of [12], we get a cache efficient version of CPU 1. To arrive at CPU 2, we change Line 13 of Algorithm 1 to "t ← M [row][k] + M [col][k + 1]", and change Line 16 to "M [row][col] ← M [col][row] ← max". Values of M [k + 1][col] locate in the same column but values of M [col][k + 1] locate in the same row, for row ≤ k < col. Reading values in a row is more cache efficient than reading values in a column.
Multicore versions of CPU 1 and CPU 2, respectively labeled OMP 1 and OMP 2, are obtained by inserting OpenMP statements to parallelize the for loops of lines 1, 4, and 8.
Our GPU algorithm
In our block strategy, we partition the upper triangle of the C matrix into square blocks except that adjacent to the main diagonal the partitioning is into triangles and that at the right end is into possibly nonsquare rectangles. Figure 4 shows the partitioning for the case n = 20 using 4 × 4 blocks. Notice the triangles adjacent to the main diagonal and the 4 × 1 nonsquare partitions at the right end. The blocks (whether triangular, square, or nonsquare) are indexed lefttoright toptobottom beginning with (0, 0). In keeping with the traditional way to number blocks for GPU computation, the first coordinate increases as you move to the right and the second as you move down. So "X" (Figure 4) resides in (4, 1), "K" in (4, 4), and "P" in (3, 2). Blocks on the main diagonal are indexed (i, i) and are triangular. For the dependencies in Equation 3, it follows that blocks that lie on the same diagonal of blocks (i.e., blocks with the index (i, i − k) for some fixed k) are independent and so may be computed in parallel but that elements within a block are to be computed by diagonals. Our 20 × 20 example of Figure 4 has 6 diagonals of blocks and so six iterations of computation with each iteration computing all blocks on the same diagonal in parallel are required.
As noted earlier, the first diagonal of blocks is comprised of triangles. To each triangular block, we assign a thread block. The threads within the assigned thread block compute the elements in the triangular block in diagonal order with all elements on the same diagonal being computable in parallel. Hence, for this computation, the number of thread blocks equals the number of triangular blocks.
Let us turn our attention now to the remaining blocks (i.e., the nontriangular blocks). Notice that when we start the computation of the elements in, say, block (4, 1), where "X" resides, "a" to "j", and "C" to "L" have already been computed, because they are on preceding block diagonals. But "k", "l", "A", and "B" have yet to be computed. The computation of the maximum of "c+C" to "j+J" can be done using a kernel maxKernel (described later). This kernel uses registers for temporary values and writes these temporary values to shared memory upon completion. The final value for "O" can be obtained by comparing the temporary maximum value in "O" with "P" plus the bond value in Equation 3. Then the maximum of "r+O", "q" plus its bond value, and the temporary maximum value in "m" is written to "m" as its final value. Similarly, for "M", the maximum of "O+R", "Q" plus its bond value, and the temporary maximum value in "M" is written to "M" as its final value. The computations for "m" and "M" can be done in parallel. So the computation within element block (4, 1) is done in diagonal order. All elements on the same diagonal can be computed in parallel with all data residing in shared memory. The pseudocode is shown as Algorithm 2.
Algorithm 2 Our GPU algorithm
Require: RNA sequence R, blocked diagonal index D, block size BS
Ensure: C matrix
1: Register[16] reg
2: Shared _Memory[BS][BS] sA
3: Shared_Memory[BS + 1][BS + 1] sC
4: Global _Memory[BS][BS] gB
5: Global _Memory[BS][BS] gC
6: tx ← threadIdx.x; ty ← threadIdx.y
7: bx ← blockIdx.x; by ← blockIdx.y
8: aRow ← by × BS; aCol ← aRow − 1
9: bRow ← aRow; bCol ← D × BS − 1 + aRow
10: for blk ← 1 to D − 1 do
11: sA ← the block starting at (aRow, aCol + blk × BS)
12: gB ← the block starting at (bRow + blk × BS, bCol)
13: maxKernel(sA, gB, reg)
14: Syncthreads
15: end for
16: sC ← reg
17: for d ← 1 to BS × 2 − 1 do
18: for Each element e on diagonal d do
19: Finish remaining computation
20: end for
21: Syncthreads
22: end for
23: gC ← sC
Algorithm 3 maxKernel
Require: Block sA in shared memory, block gB in global memory
Ensure: Partial result of reduction in reg
r 0 ← gB[0][tx]; r 1 ← gB[1][tx]
r 2 ← gB[2][tx]; r 3 ← gB[3][tx]
for j ← 0 to 6 do
for k ← 0 to 15 do
reg[k] ← max{reg[k], r 0 + sA[ty × 16 + k][j × 4]}
end for
r 0 ← gB[(j + 1) × 4][tx]
// 2 similar for loops for r1 and r2 come here
for k ← 0 to 15 do
reg[k] ← max{reg[k], r 3 + sA[ty × 16 + k][j × 4 + 3]}
end for
r 3 ← gB[(j + 1) × 4 + 3][tx]
end for
for k ← 0 to 15 do
reg[k] ← max{reg[k], r 0 + sA[ty × 16 + k][28]}
end for
// 2 similar for loops for r1 and r2 come here
for k ← 0 to 15 do
reg[k] ← max{reg[k], r 3 + sA[ty × 16 + k][31]}
end for
Description of maxKernel
We note that [11] also employs a maxKernel but their kernel is different from ours.
Results
We benchmarked our algorithms using a PC with a hyperthreaded 6core Intel i7x980 3.33GHz CPU and 12GB DDR3 RAM. The PC also had NVIDIA Tesla C2050 GPU. Since only two threads may be scheduled per i7 core at any time, a maximum of 12 threads may be gainfully used. We used randomly generated RNA sequences in our experiments. Since the run time of our codes is relatively insensitive to the actual RNA sequence in use due to the fact that the entire computation is to fill out an n × n matrix, our use of random sequences does not materially impact our conclusions.
Single and multicore algorithms
Running time (seconds) of different CPU algorithms
n  CPU 1  OMP 1  Ratio  CPU 2  OMP 2  Ratio 

3000  35.9  7.1  5.1  22.3  4.8  4.6 
4000  98.1  18.6  5.3  52.8  11.3  4.7 
5000  208.1  41.6  5.0  102.9  22.2  4.6 
6000  363.7  72.2  5.0  177.5  45.3  3.9 
7000  646.1  125.2  5.2  281.3  61.0  4.6 
8000  924.4  197.8  4.7  419.6  92.5  4.5 
9000  1461.5  291.0  5.0  596.6  129.9  4.6 
10000  1927.7  395.0  4.9  819.1  176.9  4.6 
11000  2800.8  559.2  5.0  1088.4  234.5  4.6 
12000  3525.2  741.4  4.8  1413.6  303.3  4.7 
13000  4852.3  978.8  5.0  1795.4  388.4  4.6 
14000  6008.9  1250.2  4.8  2243.5  485.2  4.6 
15000  7930.0  1641.4  4.8  2757.3  594.0  4.6 
16000  10120.0  2380.8  4.3  3343.5  725.4  4.6 
GPU algorithms
Running time (seconds) of different GPU algorithms
n  Ours 1  Ours 2  [12]  OursR  Ratio 1  Ratio 2 

2000  0.1  0.1  0.3  0.1  3.0  1.0 
6000  0.6  0.7  4.0  0.8  6.7  1.3 
10000  1.9  2.2  16.4  3.2  8.6  1.7 
14000  4.5  5.1  43.0  7.9  9.6  1.8 
18000  8.8  9.9  89.5  16.0  10.2  1.8 
22000  15.1  16.9  161.7  28.2  10.7  1.9 
26000  23.9  26.7  266.3  45.8  11.1  1.9 
37000    71.5         
Single core vs multicore vs GPU
Speedup of Ours1 relative to other versions
n  CPU 2  OMP 2  n  CPU 2  OMP 2 

3000  157.0  33.8  10000  424.4  91.7 
4000  224.7  48.1  11000  441.4  95.1 
5000  259.8  56.1  12000  465.9  100.0 
6000  302.9  77.3  13000  472.3  102.2 
7000  341.8  74.1  14000  496.9  107.5 
8000  376.0  82.9  15000  503.3  108.4 
9000  392.2  85.4  16000  522.6  113.4 
Conclusions
We have developed simple and efficient single and multicore algorithms as well as an efficient GPU code for RNA folding based on Nussinov's equations [2]. Our cache efficient singlecore algorithm provides a speedup between 1.6 and 3.0 relative to a naive straightforward single core code. The multicore version of the cache efficient single core algorithm provides a speedup, relative to the naive single core algorithm, between 7.5 and 14.0 on a 6 core hyperthreaded CPU. Our GPU algorithm, Ours 1, for the NVIDIA C2050 is up to 1582 times as fast as the naive single core algorithm and between 3.0 and 11.1 times as fast as the fastest previously known GPU algorithm for Nussinov RNA folding. With the available 3GB device memory on an NVIDIA GPU, Ours 1 is able to handle sequences of length up to 26000. Sequences of length between 26000 and 37000 may be handled using Ours 2, which uses a onedimensional array mapping of the upper triangle of the output matrix rather than a twodimensional array that represents the full output matrix. Ours 2, however, runs slightly slower than Ours 1. Our methods can be used to speedup up RNA folding using Zuker's equations as well [3, 11].
List of abbreviations used
 RNA:

RiboNucleic Acid
 GPU:

Graphics Processing Unit
 PCIExpress:

Peripheral Component Interconnect Express
 CUDA:

Compute Unified Device Architecture
 GCUPS:

Billion Cell Updates per Second
 SM:

Streaming Multiprocessors
 DRAM:

Dynamic RandomAccess Memory
 TFLOPS:

Trillion Floating Point Operations Per Second
 GFLOPS:

Billion Floating Point Operations Per Second
 I/O:

Input/Output
 CPU:

Central Processing Unit.
Declarations
Declarations
Publication of this supplement was funded, in part, by the National Science Foundation under grants CNS0963812, CNS1115184, CNS0905308, and the National Institutes of Health under grant R01LM010101.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 8, 2014: Selected articles from the Third IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2013): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S8.
Authors’ Affiliations
References
 Waterman MS, Smith TF: Rna secondary structure: A complete mathematical analysis. Math Biosc. 1978, 42: 257266. 10.1016/00255564(78)900998.View ArticleGoogle Scholar
 Nussinov R, Pieczenik G, Griggs JR, Kleitman DJ: Algorithms for Loop Matchings. SIAM Journal on Applied Mathematics. 1978, 35 (1): 6882. 10.1137/0135006.View ArticleGoogle Scholar
 Zuker M, Stiegler P: Optimal computer folding of large rna sequences using thermodynamics and auxiliary information. Nucleic Acids Research. 1981, 9 (1): 133148. 10.1093/nar/9.1.133.PubMed CentralView ArticlePubMedGoogle Scholar
 Mathuriya A, Bader DA, Heitsch CE, Harvey SC: Gtfold: a scalable multicore code for rna secondary structure prediction. Proceedings of the 2009 ACM Symposium on Applied Computing SAC '09, pp 981988. 2009, ACM, New York, NY, USAGoogle Scholar
 Swenson MS, Anderson J, Ash A, Gaurav P, Sukosd Z, Bader DA, Harvey SC, Heitsch CE: Gtfold: Enabling parallel rna secondary structure prediction on multicore desktops. BMC Res Notes. 2012, 5 (1): 34110.1186/175605005341.PubMed CentralView ArticlePubMedGoogle Scholar
 Tan G, Sun N, Gao GR: A parallel dynamic programming algorithm on a multicore architecture. Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures SPAA '07, pp 135144. 2007, ACM, New York, NY, USAGoogle Scholar
 Estrada T, Licon A, Taufer M: comppknots: a framework for parallel prediction and comparison of rna secondary structures with pseudoknots. Proceedings of the 2006 International Conference on Frontiers of High Performance Computing and Networking ISPA'06, pp 677686. 2006, Springer, Berlin, HeidelbergGoogle Scholar
 Xia F, Dou Y, Zhou X, Yang X, Xu J, Zhang Y: Finegrained parallel rnaalifold algorithm for rna secondary structure prediction on fpga. BMC Bioinformatics. 2009, 10: 114. 10.1186/14712105101.View ArticleGoogle Scholar
 Jacob A, Buhler J, Chamberlain RD: Accelerating nussinov rna secondary structure prediction with systolic arrays on fpgas. ApplicationSpecific Systems, Architectures and Processors, 2008 ASAP 2008 International Conference On, pp 191196. 2008Google Scholar
 Dou Y, Xia F, Jiang J: Finegrained parallel application specific computing for rna secondary structure prediction using scfgs on fpga. Proceedings of the 2009 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems CASES '09, pp 107116. 2009, ACM, New York, NY, USAGoogle Scholar
 Lavenier D, Rizk G, Rajopadhye S: Gpu accelerated rna folding algorithm. GPU Computing Gems. 2011Google Scholar
 Chang DJ, Kimmer C, Ouyang M: Accelerating the nussinov rna folding algorithm with cuda/gpu. Signal Processing and Information Technology (ISSPIT), 2010 IEEE International Symposium On, pp 120125. 2010Google Scholar
 Volkov V, Demmel JW: Benchmarking gpus to tune dense linear algebra. Proceedings of the 2008 ACM/IEEE Conference on Supercomputing SC '08, pp 3113111. 2008, IEEE Press, Piscataway, NJ, USAGoogle Scholar
 NVIDIA: NVIDIA's Next Generation CUDA Compute Architecture: Fermi. NVIDIA. 2009, [http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf]1.1
 Durbin R, Eddy S, Krogh A, Mitchison G: Biological Sequence Analysis. 2006, 9296. Chap. 4.4, 11Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.