 Research article
 Open Access
 Published:
Tiling Nussinov’s RNA folding loop nest with a spacetime approach
BMC Bioinformatics volume 20, Article number: 208 (2019)
Abstract
Background
An RNA primary structure, or sequence, is a single strand considered as a chain of nucleotides from the alphabet AUGC (adenine, uracil, guanine, cytosine). The strand can be folded onto itself, i.e., one segment of an RNA sequence might be paired with another segment of the same RNA sequence into a twodimensional structure composed by a list of complementary base pairs, which are close together with the minimum energy. That list is called RNA’s secondary structure and is predicted by an RNA folding algorithm. RNA secondary structure prediction is a computingintensive task that lies at the core of search applications in bioinformatics.
Results
We suggest a spacetime tiling approach and apply it to generate parallel cache effective tiled code for RNA folding using Nussinov’s algorithm.
Conclusions
Parallel tiled code generated with a suggested spacetime loop tiling approach outperforms known related codes generated automatically by means of optimizing compilers and codes produced manually. The presented approach enables us to tile all the three loops of Nussinov’s recurrence that is not possible with commonly known tiling techniques. Generated parallel tiled code is scalable regarding to the number of parallel threads – increasing the number of threads reduces code execution time. Defining speed up as the ratio of the time taken to run the original serial program on one thread to the time taken to run the tiled program on P threads, we achieve superlinear speed up (a value of speed up is greater than the number of threads used) for parallel tiled code against the original serial code up to 32 threads and superlinear speed up scalability (increasing speed up with increasing the thread number) up to 8 threads. For one thread used, speed up is about 4.2 achieved on an Intel Xeon machine used for carrying out experiments.
Background
Ribonucleic acid (RNA) molecule is one of the most important molecules in the biological systems. RNA is typically produced as a single stranded molecule, which then folds intramolecularly to form a number of short basepaired stems. This basepaired structure is called the secondary structure of the RNA. The dynamic programming approach to RNA secondary structure prediction relies on the fact that structures can be recursively decomposed into smaller components. In each of the decomposition steps, only a single loop (or stacking of two consecutive base pairs) needs to be evaluated.
Nussinov proposed a dynamic programming algorithm for RNA folding in 1978 [1], which maximizes the number of noncrossing matchings between complimentary bases of an RNA sequence of length N.
Let X=x_{1},x_{2},…,x_{N} be an RNA sequence, where x_{i}∈{G(guanine),A(adenine),U(uracil),C(cytosine)} is a nucleotide. Nussinov’s dynamic programming recurrence for N×N matrix S is given below.
Here, S(i,j) defines the maximum number of basepair matches of x_{i},…,x_{j} over the region 1≤ i<j≤N and σ(i,j) is a function, which returns 1 if (x_{i},x_{j}) is an AU, GC, or GU pair and 0 otherwise.
Listing 1 represents the triply nested affine loops with two statements accessing the twodimensional array S implementing Nussinov’s algorithm.
Fast Nussinov implementations for shared memory architectures must ensure both aspects of code parallelism and cache optimization. Cache optimization is not found or limited to the memory layout optimization to improve spatial locality in popular parallel implementations of RNA folding, for example, GTFold [2], UNAfold [3] or RNAfold [4], which, however, implement energy minimization.
There are several manual or empirical approaches in literature improving data locality of serial or multithreaded RNA folding code, e.g. [5–10], dedicated to various hardware platforms including GPUs and FPGAs.
Li et al. [5] suggested a cache efficient version of Nussinov’s recurrence by using the lower triangle of matrix S to store the transpose of the computed values in the upper triangle of S. As new S_{i,j}s are computed, they are stored in both S_{i,j} and S_{j,i} for j≤i. The sum S_{i,k}+S_{k+1,j} is computed as S_{i,k}+S_{j,k+1}. Hence, Li’s modifications accelerate rapidly code execution because reading values in a row is more cache efficient than reading values in a column [5].
Zhao and Sahni developed three cacheefficient algorithms without increasing the memory requirement, ByRow, ByRowSegment, and ByBox for Nussinov’s RNA folding [6]. They showed that presented techniques based on a simple LRU cache model give better run time and energy performance than Li’s approach. Unfortunately, no parallel code is presented by the authors.
The effectiveness of automatic tiling and parallelization of loop nests depends a lot of their dependence patterns. The dependences of a loop nest can be classified into two categories: uniform and nonuniform. The dependences are uniform only when the distances between dependent loop nest statement instances in the iteration space are uniform, i.e., these distances are expressed by constants; otherwise they are nonuniform. A set of distance vectors represents distances between dependent loop nest statement instances calculated as the difference between the iteration vectors representing the destinations and sources of dependences.
For uniform dependences, the corresponding dependence graph is regular while for nonuniform dependences it is irregular. Automatic tiling and parallelization of loop nests with nonuniform dependences by means of affine transformations is much difficult than those exposing uniform dependences. The reason is that for such dependences, in general, constraints formed to extract affine transformations (to be next applied to tile and parallelize loops) are parametric and nonlinear, this considerably increases the computational complexity of extracting affine transformations. Even when affine transformations can be found, they do not guarantee efficient loop tiling and parallelization: only some loops from all ones in the loop nest can be tiled and/or parallelized.
The Nussinov kernel involves mathematical operations over affine control loops whose iteration space can be represented by the polyhedral model [11]. However, the Nussinov RNA folding acceleration is still a challenging task for modern compilers because that code is within nonserial polyadic dynamic programming (NPDP), which is a particular family of dynamic programming with nonuniform data dependences, and it, as mentioned above, is more difficult to be optimized [7].
Optimizing compilers usually apply loop tiling to generate cacheefficient code on multicore architectures that maximizes data reuse in deep memory hierarchies and reduces synchronization cost [12]. Loop tiling transformations allow for improving data locality and generate coarsegrained parallel code that leads to improving code performance [13]. The most popular techniques of tiling are based on the affinetransformation framework (ATF), which is implemented in several tools [12].
Pluto [14] is the most popular stateoftheart sourcetosource polyhedral code generator that transforms C programs to parallel coarsegrained code with enhanced data locality. Pluto uses a scheduling algorithm, which tries to find affine transformations allowing the most efficient tiling. The main purpose is to minimize the amount of intertiles communications and ensure parallelism among tiles. The Pluto schedule is optimal, it reduces the number of dependences crossing tile boundaries. Unfortunately, Pluto fails to generate a band of loops where all multiple consecutive loops may be interchanged while respecting the legality in the case of rectangular tiling for NPDP kernels. Pluto serializes the innermost loop of Nussinov’s RNA folding, which is a key of cache locality optimization [15]. As a consequence, Pluto fails to generate 3D tiles that prevents achieving maximal code locality and performance.
Mullapudi and Bondhugula introduced a dynamic tiling technique for Zuker’s RNA secondary structure prediction [11]. Their technique overcomes some limitations of the affine transformation framework. Generated tiles are of the 3D dimension and they can be scheduled only at runtime, i.e., that technique does not allow for generation of any static code.
Wonnacott et al. suggested 3D tiling of “mostlytileable” loop nests representing Nussinov’s RNA secondary structure prediction [16]. This approach extracts nonproblematic statement instances in the loop nest iteration space, i.e., those that can be safely tiled by means of wellknown techniques. The reminding statement instances should be run serially to preserve all the dependences available in the loop nest. Unfortunately, the approach allows for generation of only serial code.
The tiling technique presented in paper [15] transforms (corrects) original rectangular tiles into target ones, which are valid under lexicographic order. Tile correction is performed by means of the transitive closure of loop dependence graphs. Loop skewing is used to parallelize code. We achieved higher speedup of generated tiled code in comparison with that produced with stateof theart sourcetosource optimizing compilers. However, the correction technique generates irregular tiles, some of them can be too large, this does not allow us to achieve maximal code locality and performance [17].
In this paper, we present a novel approach of spacetime tiling for accelerating Nussinov’s RNA folding, which allows for generation of tiled code with the following features:

the dimension of generated tiles is 3D,

generated code can be easily parallelized by means of the skewing technique,

target parallel code is scalable regarding to the number of threads and the length of an RNA sequence,

generated parallel code is regular and compact, it outperforms known automatically and manually generated related codes.
The concept of spacetime tiling is the following. Scrutinizing the Nussinov loop nest, we discover that dependences along both axis i and j spread in only the forward direction, i.e., the two corresponding elements of all dependence distance vectors are nonnegative (taking into account that the value of index i is decremented). Using this fact, we split the iteration spaces of the loop nest statements into two groups of subspaces of a fixed width, which intersect axes i and j and are in parallel with planes (j,k) and (i,k), respectively. Figure 1 presents such subspaces for statement S1. Blue lines depict planes situated in parallel with plane (j, k) while red ones present planes located in parallel with plane (i, k). Each subspace of the first and second groups is represented with identifiers id_{1} and id_{2}, respectively; integers in brackets are the identifiers of subspaces; name SPACEi(j) states for the subspace belonging to group i,i=1,2 and its identifier is j=0,1.
Then we form tiles as the intersection of sets representing the subspaces mentioned above, see Fig. 1. Each such a tile is represented with an identifier (id_{1},id_{2}), where id_{1} and id_{2} are the identifiers of the corresponding subspaces, in Fig. 1, they are shown in brackets. Such tiles are valid under lexicographical order because intertile dependences are spread in only the forward directions regarding to both id_{1} and id_{2}, i.e, the both corresponding elements of all dependence distance vectors are nonnegative.
The size of generated tiles is limited along axes i and j with the width of subspaces, but it is not limited along axis k that reduces code locality – see Fig. 1. To split each tile generated as presented above into subtiles along axis k, we can find any valid schedule of statement instances allowing for forming time partitions, which are to be enumerated serially, while statement instances of each time partition can be run in parallel. Let us consider Fig. 2. It represents the iteration space of a single tile provided that the subspace width is equal to 3. Suppose that dependences within that tile are characterized with the following set of distance vectors.
{(0,0,1),(0,1,0),(1,0,−1)}.
Distance vectors allow us to generate all dependences in the iteration space shown in Fig. 2. To find the destination of a dependence with a given dependence source within the iteration space, we add a distance vector to this source and if the resulting iteration is within the iteration space, we conclude that there is the dependence in that iteration space.
In Fig. 2, we show with red arrows some dependences corresponding to the distance vectors above. Valid time partitions, ti, can be represented with the following sets, each including iterations, which can be run in parallel.
t0:={ (0,0,0) },
t1:={ (0,0,1),(0,1,0)},
t2:={ (0,0,2),(0,1,1),(0,2,0),(1,0,0) },
t3:={ (0,1,2),(0,2,1),(1,0,1),(1,1,0) },
t4:={ (0,2,2),(1,0,2),(1,1,1),(1,2,0),(2,0,0) },
t5:={ (1,1,2),(1,2,1),(2,0,1),(2,1,0) },
t6:={ (1,2,2),(2,0,2),(2,1,1),(2,2,0) },
t7:={ (2,1,2),(2,2,1) },
t8:={ (2,2,2) }.
Each time partition comprises independent iterations, which can be executed in parallel while time partitions should be enumerated in lexicographical order. In Fig. 2, labels ti,i=0,1,2,…,8 mark a time partition comprising the corresponding iteration.
Next we combine time partitions into time slices each including a fixed number of time partitions. Provided that the number of time partitions in a slice is equal to 3, we get the following time slices, TIMEi, i=1,2,3.
TIME1:=t0 ∪ t1 ∪ t2={ (0,1,2),(0,2,1),(1,0,1),(0,0,1),(1,1,0),(0,1,0),(0,0,0) },
TIME2:=t3 ∪ t4 ∪ t5={ (0,2,2),(1,1,2),(0,1,2),(1,0,2),(1,2,1),(0,2,1),(1,1,1),(2,0,1),(1,0,1),(1,2,0),(2,1,0),(1,1,0),(2,0,0) },
TIME3:=t6 ∪ t7 ∪ t8={ (2,2,2),(1,2,2),(2,1,2),(2,0,2),(2,2,1),(2,1,1),(2,2,0) }.
Enumerating time slices is valid under lexicographic order because before executing the next slice all data necessary are already calculated in the previous slices. So, within a single tile, we form subtiles represented with time slices whose execution in serial order increases code locality because instead of execution of one larger tile, multiple smaller subtiles will be executed.
To generate target tiles, we apply the intersection operation to sets representing subspaces and time slices. In the following section, we prove that for the Nussinov loop nest, resulting tiles are valid under lexicographical order and demonstrate how parallel target code can be generated.
Methods
In compiler research, polytopes and related mathematical objects have been successfully applied to represent and manipulate an important class of compute and dataintensive programs in an approach that has become known as the polyhedral model. It formalizes analyzing, parallelizing, and transforming program fragments consisting of (sequences of) arbitrarily nested loops (like dynamic programming loops), where the loop bounds, statements conditions and array accesses are affine combinations of symbolic constants and loop iterators.
The polyhedral method treats each iteration of a loop statement within the loop nest as an integral point inside mathematical objects called polyhedra that contains all iterations of the statement. A convex polyhedron can be formally defined as the set of solutions to a system of linear inequalities of the form Mx≤b.
The polyhedral model of a loop nest includes i) a set representing an iteration space for each statement, ii) access relations (read and write) for each array available in the loop nest body, iii) relations describing a global schedule for each statement – a discrete time when a statement instance is executed according to the original iteration execution order.
A mathematical representation of a set is the following.
S:=PARAMS→{NAME(I)  constraints}, where S is the set name, “ PARAMS→” means that the constraints include parameters PARAMS, each parameter is an arbitrary integer whose value defines an upper loop bound, NAME(I) is the named tuple with name NAME and vector I whose elements are loop indices or expressions including loop indices; constraints are comprised of affine equations and inequalities including vector I and parameters PARAMS combined through the conjunction (∧), disjunction (∨), projection (∃), and negation (¬) operators. For example, S:=N→{S1(i,j) 1<=i<=N ∧ 1<=i<=N} denotes a parametric set (regarding to parameter N) with name S, named tuple S1(i,j), and constraints 1<=i<=N ∧ 1<=i<=N.
Relations are defined in a similar way as sets, except that a single tuple is replaced with a pair of tuples separated by the arrow sign “ →”, i.e, the mathematical representation of a relation is the following.
R:=PARAMS→{NAME1(I)→NAME2(J)constraints}, where R is the relation name, NAME1(I) and NAME2(J) are the named tuples with names NAME1 and NAME2 including vectors I and J, respectively. For example, R:=N→{S1(i1,j1)→S2(i1,j2) 1<=i1,i2<=N ∧ 1<=i1,j2<=N} denotes the parametric relation with name R, tuples S1(i1,j2),S2(i2,j2) and the constraints 1<=i<=N1<=i1,i2<=N ∧ 1<=i1,j2<=N. A relation maps elements represented with the first tuple to elements of the second one.
To extract the polyhedral model for a C program (all its three components), the PET (Polyhedral Extraction Tool) tool can be applied [18]. To extract dependences available in a loop nest, we use the polyhedral model returned with PET and apply the iscc calculator [19] to implement calculations on polyhedral sets and relations in a way presented in paper [20]. Iscc is an interactive interface to the barvinok counting library [21] and PET.
To optimize code, we use the following operations on relations and sets: intersection (∩), union (∪), difference (), domain (dom R), range (ran R), relation application (S ^{′}=R(S): e ^{′}∈S ^{′} iff exists e s.t. e →e ^{′}∈R,e ∈S) [21].
Given the loop nest, the iteration i is lexicographically less than iteration j, denoted as i≺j, if the following conditions are true i_{1}<j_{1}∨∃k≥1:i_{k}<j_{k}∧i_{t}=j_{t},for t <k.
The iteration domain of the Nussinov loop nest in Listing 1) is represented with the following parametric set comprising all the statement instances executed for statements S1 and S2.
Iteration Domain:=N→{S1(i,j,k)  0≤i≤N−−1∧ i+1≤j≤N−−1∧ 0≤k≤j−i−1;S2(i,j)  0≤i≤N−−1∧ i+1≤j≤N−−1},
where S1(i,j,k),S2(i,j) are the tuples defining the iteration domain of the first and second statements of the Nussinov loop nest, respectively; the constraints of the tuples specify the value range of indices i, j, k in the loop nest.
The relation representing dependences available in the examined loop nest is the following.
R:=N→{ S1(i, j, k)→S2(i^{′},i+j−i^{′})∣0≤k<−i+j∧i^{′}≥−1+i∧i^{′}≥0∧−N+i+j<i^{′}≤i }∪N→{ S1(i,j,k)→S1(i,j,k^{′})∣i≥0∧j<N∧k≥0∧k<k^{′}<−i+j }∪N→{ S1(i,j,k)→S1(i,j^{′},−i+j)∣i≥0∧0≤k<−i+j∧j<j^{′}<N }∪N→{ S1(i,j,k)→S1(i^{′},j,−1+i−i^{′})∣j<N∧0≤k<−i+j∧0≤i^{′}<i }∪N→{ S2(i,j)→S1(i,j^{′},−i+j)∣i≥0∧j>i∧j<j^{′}<N }∪N→{ S2(i,j)→S1(i^{′},j,−1+i−i^{′})∣i<j<N∧0≤i^{′}<i }∪N→{ S2(i,j)→S2(−1+i,1+j)∣i>0∧i<j≤−2+N },
where “ N→” means that N is a parameter, i.e., an arbitrary constant whose value defines loop upper bounds in the relation constraints. Relation R is represented with a union (∪) of simpler relations included in curly braces. Each simple relation includes two named tuples, for example, S1(…) →S2(…) and constraints. Elements of tuples are loop iterators or affine expressions whose components are loop iterators. The left tuple defines dependence sources while the right one states for the corresponding dependence destinations. Name S1 of the left tuple means that dependence sources are originated with statement S1 while S2 marks that dependence destinations are descended with statement S2. Each constraint includes affine inequalities whose elements are loop iterators and parameter N, for example, 0≤k<−i+j, multiple inequalities are combined by means of the conjunction operator (∧).
Using relation R, we can discover the maximal number of rectangular subspace types to be formed, this number is defined with the maximal number of the outer loops for which all elements of all distance vectors (formed as the differences between the range and domain of relation R) are nonnegative. Such subspaces can be enumerated in lexicographical order because there is no cycle among them (the two corresponding elements of all distance vectors are nonnegative).
Because statements S1 and S2 have different iteration spaces, we should form a global iteration space common for both statements S1 and S2. With this purpose, we apply the global schedule of the statements of the examined loop nest. It is a part of the loop nest polyhedral model and represented with a relation, which maps an iteration vector of a statement to a corresponding multidimensional timestamp, i.e., a discrete time when the statement instance has to be executed in the common iteration space. PET returns the following global schedule for the examined loop nest.
SCHED_GLOB:=N→{ S2(i, j)→(i,j,1,0) }∪N→{ S1(i,j,k)→(i,j,0,k) }.
Next, we form relation, R_GLOB, by means of replacing each named tuple of relation R with the tuple resulting due to applying relation SCHED_GLOB to named tuples S_{1} and S_{2}. That relation describes dependences in the common iteration space and it is as follows.
R_GLOB:=N→{ (i,j,0,k)→(i,j,k^{′},1,0)∣i≥0∧j<N∧k≥0∧k<k^{′}<−i+j }∪N→{ (i,j,0,k)→(i^{′},i+j−i^{′},1,0)∣0≤k<−i+j∧i^{′}≥−1+i∧i^{′}≥0∧−N+i+j<i^{′}≤i }∪N→{ (i,j,0,k)→(i,j^{′},0,−i+j)∣i≥0∧0≤k<−i+j∧j<j^{′}<N }∪N→{ (i,j,0,k)→(i^{′},j,0,−1+i−i^{′})∣j<N∧0≤k<−i+j∧0≤i^{′}<i }∪N→{ (i,j,1,0)→(i,j^{′},0,−i+j)∣i≥0∧j>i∧j<j^{′}<N }∪N→{ (i,j,1,0)→(i^{′},j,0,−1+i−i^{′})∣i<j<N∧0≤i^{′}<i }∪N→{ (i,j,1,0)→(−1+i,1+j,1,0)∣i>0∧i<j≤−2+N }.
Then we apply the deltas operator of the iscc calculator to relation R_GLOB and get the following distance vectors in the global (common) iteration space.
N→{ (i,−i,1,k)∣−1≤i≤0∧2−N−2i≤k≤0 }∪N→{ (i,0,0,k)∣i<0∧−N−2i<k<−i }∪N→{ (0,j,0,k)∣j>0∧0<k<N−j }∪N→{ (0,j,−1,i_{3})∣j>0∧0<i_{3}<N−j }∪N→{ (i,0,−1,−1−i)∣2−N≤i<0 }∪N→{ (−1,1,0,0)∣N≥4 }.
Taking into account that the value of iterator i is decremented, we conclude that only the third element of each distance vector is negative. This allows us to state that subspaces located in parallel with plane (j,k) and intersecting axis i as well as subspaces located in parallel with plane (i,k) and intersecting axis j, can be enumerated in lexicographical order because the both elements of all the corresponding dependence distance vectors are nonnegative.
We split the iteration spaces of statements S1 and S2 into subspaces of width 16 (any other constant width can be chosen). For statement S1, sets SPACE_{11} and SPACE_{12} below represent subspaces intersecting axes i and j, respectively.
SPACE_{11}:=(ii,N)→{ S1(i,j,k)∣ii≥0∧N>0∧i≥0∧−16−16ii+N≤i<−16ii+N∧i≤−2+N∧i<j<N∧0≤k<−i+j },
SPACE_{12}:=(jj,N)→{ S1(i,j,k)∣jj≥0∧N>0∧0≤i≤−2+N∧j>16jj+i∧i<j<N∧j≤16+16jj+i∧0≤k<−i+j }.
For statement S2, sets SPACE_{21} and SPACE_{22} below represent subspaces intersecting axes i and j, respectively.
SPACE_{21}:=(ii,N)→{ S2(i,j)∣ii≥0∧N>0∧i≥0∧−16−16ii+N≤i<−16ii+N∧i≤−2+N∧i<j<N },
SPACE_{22}:=(jj,N)→{ S2(i,j)∣jj≥0∧N>0∧0≤i≤−2+N∧j>16jj+i∧i<j<N∧j≤16+16jj+i }.
Variables ii and jj are the parametric identifiers of subspaces. The intersection of the union of sets SPACE_{11}, SPACE_{12} and the union of sets SPACE_{21}, SPACE_{22} results in tiles whose size is limited along axes i and j with the width of subspaces (16), but the size of those tiles is not limited along axis k. Fig. 1 presents such subspaces and tiles for statement S1, integers in brackets are the identifiers of subspaces and tiles. Blue lines depict planes situated in parallel with plane (j, k) while red ones present planes located in parallel with plane (i, k).
For large N, the entire data associated with each such a tile cannot be held at cache, this leads to decreasing code locality. To improve code locality, we form time slices each including a constant number of time partitions. Each time partition holds statement instances that can be run in parallel for a given schedule. For a time slice, the number of statement instances within axes k is limited with a constant number of time partitions within that slice. Let us suppose that a time slice is represented with a parametric set, TIME, while a set representing the results of the intersection of above mentioned sets is named as SPACE. Then the intersection of sets TIME and SPACE results in a parametric set describing tiles whose size is limited along all axes: i, j, and k. Choosing the proper width of subspaces and the proper number of time partitions within a time slice, we may obtain tiles for which the entire data associated with each of them can be held at cache, this can improve significantly code locality.
To form time partitions, we apply the loop skewing transformation [22]. It is a convenient method to implement the wavefront method of executing a loop nest in parallel, which creates a “wave” that passes through the iteration space. Skewing changes the iteration vectors for each iteration by adding the outer loop index value to the inner one, for example, for a loop nest of depth 2 with iterators i and j, iteration (i,j) becomes relabeled as (i,i+j). If all distance vectors of a new loop nest with iterators i and i+j comprise only nonnegative elements, interchanging those iterators, i.e., forming new iterators i+j and i allows for generation of a new loop nest, where the outermost loop i+j is serial while i is parallel. In general, for a loop nest of depth d, a new loop nest with iterators (i_{1}+i_{2}+…+i_{d},i_{1},i_{2},…,i_{d−1}) is valid if all distance vectors comprise only nonnegative elements; the first loop is serial while the reminding ones are parallel [22].
For the Nussinov loop nest, we first form the following schedule.
SCHED:=N→{ S1(i, j, k)→(−i+j,k)∣N>0∧0≤i≤−2+N∧i<j<N∧0≤k<−i+j }∪N→{ S2(i,j)→(−i+j,j)∣N>0∧0≤i≤−2+N∧i<j<N }.
That schedule maps each instance of statements S1 and S2 to twodimensional time (−i+j,k) and (−i+j,j), respectively. To check whether that schedule is valid, we apply the way suggested in paper [23], which envisages checking whether the following inequality δ((SCHED^{−1}). R. SCHED)≥ 0 is true, where R is the relation describing all the dependences available in the examined loop nest, it is presented above; “.” is the iscc join (composition) operator of two relations; δ is the deltas iscc operator that maps a relation to the differences between image and domain elements.
Using such a checking, we conclude that relation SCHED above is valid. By means of relation SCHED, we form set TIME representing time slices each including 16 time partitions (any other constant value can be chosen). With this purpose, we first calculate the inverse relation, SCHED^{−1}, of relation SCHED.
SCHED^{−1}:=N→{ (i0,i1)→S10(i,i0+i,i1)∣N>0∧i_{0}>0∧0≤i_{1}<i0∧0≤i<N−i0∧i≤−2+N }∪N→{ (i_{0},i_{1})→S12(−i0+i1,i1)∣N>0∧i_{0}>0∧i0≤i_{1}<N∧i_{1}≤−2+N+i0 }.
In the relation above, variables i0,i1 represent twodimensional time. To form set TIME, we i) make i0 to be the parameter of set TILE, ii) make the right tuple and the constraints of relation SCHED^{−1} to be the tuple and constraints of set TIME, iii) introduce parameter tt of set TIME and add the constraints of the form i1≥16tt∧0≤i1≤15+16tt to the constraints of set TIME, those constraints mean that the width of a time slice is 16 (any other constant value can be chosen) and time partitions within a time slice are dependent on parameter tt. This results in the following set.
TIME:=(i0,tt,N)→{ S1(i,i0+i,i1)∣i0>0∧N>0∧0≤i<−i0+N∧i≤−2+N∧i1≥16tt∧0≤i1≤15+16tt∧i1<i0 }∪(i0,tt,N)→{ S2(i,i0+i)∣i0>0∧N>0∧i≥0∧−i0+16tt≤i<−i0+N∧i≤15−i0+16tt∧i≤−2+N }.
We calculate set TILE, which represents target tiles, as follows
TILE:=TIME. (SPACE_{11}∪SPACE_{12}). (SPACE_{21}∪SPACE_{22})=(N, jj, ii, i0, tt)→{ S2(i,i0+i)∣jj≥0∧N>0∧ii≥0∧i0>16jj∧0<i0≤16+16jj∧i≥−16+N−16ii∧i≥0∧−i0+16tt≤i≤15−i0+16tt∧i<N−16ii∧i<N−i0∧i≤−2+N }∪(jj,N,ii,i0,tt)→{ S1(i,i0+i,k)∣jj≥0∧N>0∧ii≥0∧i0>16jj∧0<i0≤16+16jj∧i≥−16+N−16ii∧0≤i<N−16ii∧i<N−i0∧i≤−2+N∧k≥16tt∧0≤k≤15+16tt∧k<i0 }.
To find out what is the number of statement instances within a tile represented with the set above, we applied the iscc card operator to set TILE, which calculates the number of elements within a set. The analysis of the result returned with that operator allows us to conclude that the size of each tile is not parameterized, i.e., the number of elements within each tile does not depend on the parameter N defining the upper bounds in the Nussinov loop nest as it takes place when the tile correction technique [15] is applied to generated target tiles.
Let us rewrite set TILE in the following form.
TILE:=(N, jj, ii, i0, tt)→{ S1(i,i0+i,k)∣ constraints_{1}; S2(i,i0+i)∣ constraints_{2} }.
To generate parallel code on the tile level, we apply the skewing transformation (ii+jj) to form the following schedule allowing for parallel code generation.
SCHED_PAR:=N→{ S1(i,i0+i,k)→(ii+jj,jj,i0,tt, i, i0+i, k)∣ constraints_{1};S2(i,i0+i)→(ii+jj,jj,i0,tt, i, i0+i)∣ constraints_{2}},
where constraints_{1} and constraints_{2} are the constraints of set TILE above.
That schedule maps each instance of statements S1 and S2 to a time partition whose all tiles can be executed in parallel. To check the validity of that schedule we again apply the technique presented in paper [23] and explained above to relation SCHED_PAR, and confirm that the schedule is valid.
Applying the iscc codegen operator to relation SCHED_PAR, we generate pseudocode and postprocess it to the parallel tiled code presented in Listing 2. In that code, the outermost loop c0 enumerates serially time partitions including tiles, loop c1 scans tiles to be executed in parallel for each time partition, loops c3,c4, and c6 enumerate serially statement instances within each tile, pragma omp parallel for makes loop c1 parallel [24].
Results
We conducted experiments on RNA secondary structure prediction using a machine with 2 × Intel Xeon processors E52695 v2, 2.4 GHz, 12 cores/24 threads, 256KB L2 Cache and 30MB L3 Cache, and 128 GB RAM. All programs were compiled with using of the Intel C++ Compiler (icc 17.0.1) with the O3 flag of optimization. The code parallelism is presented with the OpenMP programming interface [24].
At the address traco.sourceforge.net/nuss.tar.gz , all source codes used for carrying out experiments can be found as well as a program allowing us to run each parallel program for a prepared sequence in the FASTA format and obtain a target Nussinov table.
To carry out experiments, we used randomly generated RNA strands of length from 2500 to 15000. Papers [5, 6, 15] show that cache efficient code performance does not change based on strings themselves, but it depends on the size of a string.
We compared the performance of code generated with the presented approach with that of i) Pluto parallel tiled code (based on affine transformations), ii) tiled code based on the correction technique [15, 17], and iii) the Li manual cache efficient implementation [5] of Nussinov’s RNA folding. The tile size 16 ×16×1 for Pluto code [14] was chosen empirically (Pluto does not tile the most inner loop) as the best among many sizes examined. For the code generated with the presented approach, the tile size 16 ×16×16 was chosen from many different tile sizes, examined by us, as one exposing the highest code performance. We used this tile size also for tiled code generated with the tile correction technique based on the transitive closure of dependence graphs. We experimented with tiled code based on correction also for the best tile size demonstrated in paper [17], 1 ×128×16.
The results in Table 1, including the execution time of the examined programs and Fig. 3 graphically presenting the corresponding code speed up against the original code, demonstrate that only the code generated with the suggested approach is scalable up to 48 threads, i.e., increasing the number of threads reduces the time of code execution. The results in Table 1 show that spacetime tiled code implementing Nussinov’s algorithm with the tile size [16 ×16×16] outperforms all examined implementations for a larger number of threads (equal or greater than 16) for N=5000.
Table 2 and Fig. 4 present execution time and speed up for various RNA sequence lengths, respectively. We can see that the presented approach allows for obtaining cache efficient tiled code, which outperforms the other examined implementations for each length.
Discussion
In paper [17], we showed that tile correction allows for generation code of the best performance when the outermost loop is serial. We revealed that the best tile size (for tile correction) is not optimal for the code generated with the suggested approach in that case, tiling all loops is a better solution.
Spacetime tiled code achieves superlinear speed up (greater than the number of threads used) of serial and parallel tiled code against the original serial code up to 32 threads. However super linear speed up scalability (increasing speedup with increasing the thread number) is observed only from one to eight threads. Code regularity and lack of parametric large tiles allow us to use the entire power of the multiprocessor machine with the maximal number of threads while the speed up of Li’s code and that based on tile correction is significantly limited in this case.
The transposition of the Nussinov array [5] is effective only for short RNA strands with the length equal to 2500. For longer sequences (greater than 7500), only the performance of code based on applying transitive closure with the best tile size is comparable, but it is still worse than the performance of code generated with the approach presented in this paper.
Summing up, we may conclude that the parallel tiled code generated by means of the presented spacetime approach is the fastest implementation among all examined ones including the manual generated Li code. Code regularity and fixed tiles are dominant factors in achieving high code performance and scalability in comparison to our previous implementation based on tile correction. We achieved the best code performance for each RNA sequence length using all cores/threads on the studied Intel Xeon machine.
Conclusion
In this paper, we introduced a spacetime tiling approach for the loop nest implementing Nussinov’s folding. It allows us to generate parallel tiled code, which outperforms known related codes generated automatically by means of affine transformations, tile correction based on the transitive closure of dependence graphs, and manually generated Li’s transposition code. The presented approach enables us to tile all three loops of Nussinov’s recurrence that is not possible with commonly known optimizing compilers based on affine transformations. The approach generates 3D tiles for the Nussinov loop nest.
The results of an experimental study allow us to conclude that the generated code based on the spacetime approach i) outperforms known related codes, ii) is scalable regarding to the number of parallel threads (execution time decreases with increasing the thread number up to 48 threads); iii) allows for achieving superlinear speed up (greater than the number of threads used) of serial and parallel tiled code against the original serial code up to 32 threads and superlinear speed up scalability (increasing speed up with increasing the thread number) up to 8 threads, for one thread, speed up is about 4.2.
The presented code optimization can be applied to other dynamic programming kernels, for example, DNA sequence alignment or energy minimization for RNA folding. In the future, we plan to apply spacetime tiling to Zuker’s algorithm allowing for energy minimization. It is much complex than the algorithm examined in this paper – there are four nested loops and multiple complicated statements within the corresponding loop nest. However that code is still within the polyhedral model, so spacetime tiling can be applied to that code to tile all loops.
Abbreviations
 ATF:

Affine transformation framework
 FPGA:

fieldprogrammable gate array
 NPDP:

Nonserial polyadic dynamic programming
References
 1
Nussinov R, Pieczenik G, Griggs JR, Kleitman DJ. Algorithms for Loop Matchings. SIAM J Appl Math. 1978; 35(1):68–82.
 2
Mathuriya A, Bader DA, Heitsch CE, Harvey SC. Gtfold: A scalable multicore code for rna secondary structure prediction. In: Proceedings of the 2009 ACM Symposium on Applied Computing. SAC ’09. New York: ACM: 2009. p. 981–8.
 3
Markham NR, Zuker M. In: Keith JM, (ed).UNAFold. Totowa: Humana Press; 2008. pp. 3–31.
 4
Hofacker IF, Stadler PF. Memory efficient folding algorithms for circular RNA secondary structures. Bioinformatics. 2006; 22(10):1172–6. https://academic.oup.com/bioinformatics/article/22/10/1172/236586.
 5
Li J, Ranka S, Sahni S. Multicore and GPU algorithms for Nussinov RNA folding. BMC Bioinformatics. 2014; 15(8):1. https://doi.org/10.1186/1471210515S8S1.
 6
Zhao C, Sahni S. Cache and energy efficient algorithms for nussinov’s rna folding. BMC Bioinformatics. 2017; 18(15):518.
 7
Liu L, Wang M, Jiang J, Li R, Yang G. Efficient nonserial polyadic dynamic programming on the cell processor. In: IPDPS Workshops. Anchorage: IEEE: 2011. p. 460–71.
 8
Almeida F, et al.Optimal tiling for the rna base pairing problem. In: Proceedings of the Fourteenth Annual ACM Symposium on Parallel Algorithms and Architectures. SPAA ’02. New York: ACM: 2002. p. 173–82. https://doi.org/10.1145/564870.564901.
 9
Tan G, Feng S, Sun N. Locality and parallelism optimization for dynamic programming algorithm in bioinformatics. In: SC 2006 Conference, Proceedings of the ACM/IEEE. Tampa: IEEE: 2006. p. 41.
 10
Jacob A, Buhler J, Chamberlain RD. Accelerating Nussinov RNA secondary structure prediction with systolic arrays on FPGAs. In: Proceedings of the 2008 International Conference on ApplicationSpecific Systems, Architectures and Processors. ASAP ’08. Washington: IEEE Computer Society: 2008. p. 191–6. https://doi.org/10.1109/ASAP.2008.4580177.
 11
Mullapudi RT, Bondhugula U. Tiling for dynamic scheduling In: Rajopadhye S, Verdoolaege S, editors. Proceedings of the 4th International Workshop on Polyhedral Compilation Techniques. Vienna: 2014. http://impact.gforge.inria.fr/impact2014/papers/impact2014mullapudi.pdf. Accessed 15 Apr 2019.
 12
Hammami E, Slama Y. An overview on loop tiling techniques for code generation. In: 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA): 2017. p. 280–7. https://doi.org/10.1109/AICCSA.2017.168.
 13
Xue J. Loop Tiling for Parallelism. Norwell: Kluwer Academic Publishers; 2000.
 14
Bondhugula U, Hartono A, Ramanujam J, Sadayappan P. A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not. 2008; 43(6):101–13.
 15
Palkowski M, Bielecki W. Parallel tiled Nussinov rna folding loop nest generated using both dependence graph transitive closure and loop skewing. BMC Bioinformatics. 2017; 18(1):290. https://doi.org/10.1186/s1285901717078.
 16
Wonnacott D, Jin T, Lake A. Automatic tiling of ‘mostlytileable” loop nests. In: IMPACT 2015: 5th, At Amsterdam, The Netherlands: 2015. http://impact.gforge.inria.fr/impact2015/papers/impact2015wonnacott.pdf. Accessed 15 Apr 2019.
 17
Palkowski M, Bielecki W. Tuning iteration space slicing based tiled multicore code implementing nussinov’s rna folding. BMC Bioinformatics. 2018; 19(1):12.
 18
Verdoolaege S, Grosser T. Polyhedral extraction tool. In: In Proceedings of the 2nd International Workshop on Polyhedral Compilation Techniques. Paris: 2012. http://impact.gforge.inria.fr/impact2012/workshop_IMPACT/verdoolaege.pdf. Accessed 15 Apr 2019.
 19
Verdoolaege S. Counting affine calculator and applications. In: First International Workshop on Polyhedral Compilation Techniques (IMPACT’11), Charmonix, France: 2011. http://perso.enslyon.fr/christophe.alias/impact2011/impact05.pdf. Accessed 15 Apr 2019.
 20
Verdoolaege S. Integer set library  manual, http://isl.gforge.inria.fr/manual.pdf. Technical report. 2011.
 21
Verdoolaege S. barvinok: User guide 0.41: 2018. http://barvinok.gforge.inria.fr/barvinok.pdf. Accessed 15 Apr 2019.
 22
Wolfe M. Loops skewing: The wavefront method revisited. Int J Parallel Prog. 1986; 15(4):279–93.
 23
Verdoolaege S, Carlos Juega J, Cohen A, Ignacio Gomez J, Tenllado C, Catthoor F. Polyhedral parallel code generation for cuda. ACM Trans Archit Code Optim (TACO). 2013; 9(4):54.
 24
OpenMP Architecture Review Board. OpenMP Application Program Interface Version 4.5. 2015. https://www.openmp.org/wpcontent/uploads/openmp4.5.pdf. Accessed 15 Apr 2019.
Acknowledgements
Not applicable.
Funding
No specific funding was received for this study.
Availability of data and materials
At the address traco.sourceforge.net/nuss.tar.gz , all source codes used for carrying out experiments can be found as well as a program allowing us to run each parallel program for a prepared sequence in the FASTA format and obtain a target Nussinov table.
Author information
Affiliations
Contributions
MP verified the main concept of the presented technique, analysed related work, implemented the approach and carried out the experimental study. WB proposed the main concept of the presented technique, checked the correctness of the presented technique, participated in its implementation and the analysis of the results of the experimental study. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Palkowski, M., Bielecki, W. Tiling Nussinov’s RNA folding loop nest with a spacetime approach. BMC Bioinformatics 20, 208 (2019). https://doi.org/10.1186/s1285901927856
Received:
Accepted:
Published:
Keywords
 RNA folding
 Loop tiling
 Spacetime tiling
 Nussinov’s algorithm
 Parallel computing