Prediction of RNA secondary structure with pseudoknots using integer programming

Background RNA secondary structure prediction is one major task in bioinformatics, and various computational methods have been proposed so far. Pseudoknot is one of the typical substructures appearing in several RNAs, and plays an important role in some biological processes. Prediction of RNA secondary structure with pseudoknots is still challenging since the problem is NP-hard when arbitrary pseudoknots are taken into consideration. Results We introduce a new method of predicting RNA secondary structure with pseudoknots based on integer programming. In our formulation, we aim at minimizing the value of the objective function that reflects free energy of a folding structure of an input RNA sequence. We focus on a practical class of pseudoknots by setting constraints appropriately. Experimental results for a set of real RNA sequences show that our proposed method outperforms several existing methods in sensitivity. Furthermore, for a set of sequences of small length, our approach achieved good performance in both sensitivity and specificity. Conclusion Our integer programming-based approach for RNA structure prediction is flexible and extensible.


Background
Functional noncoding RNAs (ncRNAs) have been recognized as regulatory or catalytic molecules, and have received much attention in recent years. It is known that ncRNA genes cover a fair proportion of the whole genome in higher organisms including humans [1], and studying functional ncRNAs is an important task to understand complex mechanism of higher organisms. Structure analysis of ncRNAs will help us elucidate their functions since it is widely recognized that there is correlation between structure and function. Due to difficulty in determining RNA 3D structure (tertiary structure) by experimental techniques, many attempts have so far been made at predicting secondary structure given an RNA sequence (primary structure). A secondary structure is defined as a set of hydrogen-bonding base pairs such as Watson-Crick complementary pairs (i.e., A-U and C-G). One of the fundamental secondary structures is shown in Figure 1(a), which is called a hairpin loop or stem loop. Another diagrammatic representation is arc depiction where base pairs are connected by arcs over the RNA sequence (see Figure 1(b)). Among several RNAs, including rRNAs, tmR-NAs and viral RNAs, there are substructures called pseudoknots (Figure 1(c)) where some arcs over the sequence cross in the arc representation as shown in Figure 1(d). Prediction of RNA secondary structure with pseudoknots has increased in importance since pseudoknots have been known to play an important role in a number of RNA functions such as ribosomal frameshifting and splicing. Furthermore, a biologically reliable database called Pseu-doBase [2] has been constructed, which contains structural, functional and sequence data on RNA pseudoknots.
Many single-stranded RNAs are considered to fold back on themselves to be thermodynamically stable. This idea has led to the development of several algorithms that minimize the equilibrium free energy of RNA. In the pioneering work, we refer to the Zuker's algorithm [3] that predicts secondary structure with the lowest free energy by using a dynamic programming (DP) technique, and its implementation named Mfold [4] is one of the major softwares for RNA secondary structure prediction. The time complexity of the Zuker's algorithm is O(n 3 ) where n is the length of an input RNA sequence. However, this algorithm cannot deal with pseudoknots. To predict pseudoknotted structure, several DP algorithms that minimize free energy in the range from O(n 4 ) to O(n 6 ) time were proposed including PKNOTS [5], pknotsRG [6] and iter-ated loop matching (ILM) [7]. These algorithms focus on some restricted pseudoknots for solvability in polynomial time since prediction of arbitrary pseudoknotted structure was proven to be NP-hard [8,9].
There have also been several grammatical approaches to modeling some kinds of pseudoknots, including use of tree adjoining grammar [10,11], crossed-interaction grammar [12], parallel communicating grammar [13] and multiple context-free grammar [14]. In these grammatical approaches, secondary structure prediction can be interpreted as parsing of the grammars, which can be addressed in O(n 4 ) to O(n 6 ) time. Although performance of these grammar-based methods is comparable to that of energy-based methods, grammatical approaches have the advantage of the capability of constructing profile (or modeling consensus structure) when multiple structural alignment is given in advance.
The sequence alignment information is also useful for RNA secondary structure prediction. Sankoff [15] proposed a DP algorithm that can simultaneously solve the sequence alignment and folding problems for multiple sequences. Later, Sankoff's algorithm was improved by Mathews and Turner [16], where their proposed algorithm is called Dynalign. Dynalign combines free energy minimization and comparative sequence analysis to find a low free energy structure common to two sequences Example of RNA secondary structure without requiring any sequence identity.
Recently, Parisien and Major [17] presented a new analysis of RNA secondary structure prediction by incorporating all base pairs including noncanonical ones. They introduced a new representation of the nucleotide relationships in structural RNAs, called the nucleotide cyclic motif (NCM). NCM is incorporated into the scoring function of a folding algorithm, which is also based on DP.
Returning to the viewpoint of energy minimization, RNA secondary structure prediction can be regarded as a kind of optimization problem. In fact, several problems in bioinformatics can be formulated as combinatorial optimization problems. One of the successful applications is RAPTOR [18], which calculates protein threading using an integer programming (IP) formulation. Although IP problems are known to be NP-hard, descriptive power of IP is strong and flexible. In addition, recent commercial optimization softwares can deal with relatively large-scale instances even if a problem is computationally hard to solve.
In this paper, we present a new method of predicting RNA secondary structure with pseudoknots based on IP. In our model, thermodynamic information is incorporated into the objective function whose value is to be minimized, and structural information is represented by constraints. It should be noted that we focus on a practical class of pseudoknots by setting constraints appropriately. We use the CPLEX software [19] to solve the IP problem and evaluate our method on a set of RNA sequences contained in Pseu-doBase [2] and Rfam [20]. Advantages of our proposed method are summarized as follows: • Our prediction method using IP is flexible. Specifically, various types of secondary structures can be handled by adding or removing constraints. In fact, we can describe pseudoknot-free structures as well as generalized planar pseudoknots (see [8] for definition) by IP formulations, and in this paper we provide a modeling of a certain type of recursive pseudoknot (see Methods).
• The IP-based method outperforms three existing methods PKNOTS, pknotsRG and ILM in sensitivity for 34 RNA sequences known to have pseudoknots. In particular, for a set of sequences of small length ( 46), our approach achieved good performance in both sensitivity and specificity.
• Once we can model the prediction problem by an IP formulation, we need not develop any algorithm from scratch because high-performance solvers for IP are now available. Thus, we can focus on how to describe the topology of structure and how to assign appropriate scores to the model.
• The IP-based approach is extensible. If we have incomplete data on secondary structure where parts of the structure (i.e., a set of base pairs) have been determined by experiment, we can complement the incomplete structure by modeling the known parts of the structure as constraints and solving the IP problem.

Definitions of RNA secondary structure and pseudoknot
In this section we will describe the preliminary definitions of RNA secondary structure including pseudoknots.
Definition 1 (RNA secondary structure). An RNA sequence is represented by a string of n characters s = s 1 s 2 ʜ s n where s i  {A, C, G, U}. A secondary structure of the sequence s is defined as a set S of base pairs (s i , s j ) such that the following conditions are satisfied: 1. 1  i <j  n, meaning, two bases that form a pair must be located at different positions. Here, we allow only Watson-Crick base pairs (A, U) and (C, G), and a wobble base pair (G, U) to form the secondary structure, which we call valid base pairs. Definition 2 (Pseudoknot). An RNA secondary structure S is said to contain a pseudoknot if and only if there exist Definition 3 (Pseudoknot free). An RNA secondary structure S is called pseudoknot free if and only if for all pairs (s i , s j ), (s i ', s j ')  S (i <i'), one of the following conditions is satisfied: There are various kinds of pseudoknots, depending on how arcs representing base pairs cross above the sequence. For tractability in computational complexity, several classes of pseudoknots were proposed. j 0 , (i 0 < <j 0 <k 0 ) satisfying the following conditions (see Figure 2(a)): Definition 5 (Recursive pseudoknot [8]). If internal unpaired regions of a simple pseudoknot (e.g., subsequences u, v and w in Figure 2(a)) fold into pseudoknotfree structure and/or simple pseudoknot, the structure is called a recursive pseudoknot (see Figure 2(b)).
In this paper, we propose an integer programming-based method that can predict a subclass of recursive pseudoknots. It should be noted that our formulation allows unpaired regions of a simple pseudoknot to fold into pseudoknot-free structure only.

Integer programming-based model Definitions of integer programming
Integer programming (IP) is an extension of linear programming. A linear programming (LP) problem is one of the optimization problems, which optimizes a linear function subject to linear equality and/or inequality con-straints. An LP problem is composed of decision variables whose values are to be decided in some optimal fashion, an objective function to be maximized or minimized, and a set of constraints. The constraints consist of linear equalities and/or inequalities with respect to the decision variables. When the decision variables are required to be integer, the problem is called an integer programming (IP) problem. In general, an IP problem can be written as follows: where a ij , b i , c j  ‫ޒ‬ (i = 1, 2, ..., m; j = 1, 2, ..., n) and x j (j = 1, 2, ..., n) denotes a decision variable defined over a set of nonnegative integers + . Note that the maximization problem is equivalent to the minimization problem where the sign of the objective function is inverted.

Formulation for RNA pseudoknot prediction
We will formulate the RNA pseudoknotted structure prediction problem as an IP problem. Specifically, we define the decision variables, the objective function and the linear constraints with respect to the decision variables.
We first define the following decision variables: for i, j = 1, 2, ..., n. The difference between x ij and y ij is that x ij = 1 corresponds to an arc that connects two bases drawn above the sequence, while y ij = 1 represents an arc drawn below the sequence (see Figure 3(a)).
In order to incorporate stacking energy into decision variables, we use 6 × 6 (row × column, resp.) energy parameter matrix E = (e kl ) (k, l = 1, 2, ..., 6) shown in Table 1 . This matrix provides stacking energy parameters for RNA folding at 37°C given by Mfold version 3.0 [4]. Note that the value of k denotes the "type" of six valid pairs. For example, k = 1 indicates A-U pair and we call it "type 1." This statement also holds for the value of l.
We then define variables for representing the stacking pair of (s i , s j ) and (s i+1 , s j-1 ) as follows: for i, j = 1, 2, ..., n and k, l = 1, 2, ..., 6 (see Figure 3(b)). Note that the above definitions include illegal variables x i+1,0 and x n+1,j-1 for each i and j respectively (which also applies to y ij ), though we allow this notation for simplicity. applies to y ij , defined by and .
We will use variables and to represent a recursive pseudoknotted structure. We let = 1 if and only if for a base pair (s i , s j ) below the sequence, there is at least one base pair above the sequence (s i' , s j' ) such that i' <i <j'. Similarly, let = 1 if and only if for a pair (s i , s j ) below the sequence, there is at least one base pair above the sequence (s i' , s j' ) such that i' <j <j' (see Figure 4(d)). With these variables, we can formulate an IP problem for RNA pseudoknot prediction as follows: ((s i ,s j ) is type k and (s i+1 ,s j-1 ) is type l), x ij + y ij  1, t th type otherwise l), . z x kl ij y ij + y i'j'  1 (i <i' <j <j'), where i, j, i', j' = 1, 2, ..., n; k, l = 1, 2, ..., 6.
The first term in summations of the objective function is the sum of the overall energy of stacking base pairs formed above the sequence, while the second term is the sum of the overall energy of the stacking base pairs formed below the sequence. We multiply the second term by a positive weighting parameter . The weighting parameter restricts the occurrence of base pairs below the sequence. This is to take into account that pseudoknotted structure frequently does not appear in nature. The value of  suggested in [5] is  < 1, though the early work employed a dynamic programming-based approach. In our model, we define   {0.55, 0.6, 0.65, 0.7, 0.75, 0.8}.
Constraints (1) and (2) mean that if (s i , s j ) is a valid pair with type k and (s i+1 , s j-1 ) is a valid pair with type l, the energy parameter associated with (k, l) stacking type will contribute to the total energy of the structure ( Figure  3(b)). Constraints (3) and (4) say that each base can be paired with at most one base regardless of whether the pair is formed above or below the sequence (Figure 4(a)). Constraints (5) and (6) inhibit crossing pairs both above the sequence and below the sequence (Figure 4(b)). In constraints (7), (8), (9) and (10), , , and are defined respectively. Constraints (11), (12), (13) and (14) guarantee that if a base is paired with another one, its adjacent base must also form a base pair (Figure 4(c)). The purpose of these constraints is to promote stacking pairs because they are known to help to stabilize the structure.
Constraints (15) and (16) (18) is used for prohibiting bifurcation structure from occurring below the sequence (Figure 4(e)). The constraint (19) guarantees all variables to be either 0 or 1.

Data set
Test sequences that are known to form pseudoknots were taken from PseudoBase [2], where we selected 34 sequences of several kinds of secondary structures and of various lengths (21-137 bases) from different families (viral 3'UTR, mRNA, rRNA, ribozymes and tRNA-like). Specifically, we first selected a set of sequences of lengths 20-140 bases uniformly. We then checked the secondary } ∈ 0 1 structure of each sequence and removed sequences of similar secondary structure from the set.
Other test sequences that do not contain any pseudoknots were obtained from Rfam [20]. Seven pseudoknot-free sequences from different families were selected in the same way as the sequences that contain pseudoknots, so that their secondary structures and lengths are different (26-88 bases).

Implementation
After formulating the IP problem for prediction of RNA secondary structure, we employed the optimizer software called ILOG CPLEX 10.1 [19] to solve the IP formulation. CPLEX is a commercial software that can solve mathematical optimization problems, including IP problems. We implemented the IP formulation by C++ and included the C++ optimization library provided by CPLEX on a machine with Intel Xeon CPU 5160 3.00 GHz and 8.00 GB RAM.
In our implementation, we reduced the set of variables x ij and y ij to solve the problem fast. During implementation of x ij and y ij , we focused on the fact that at least two base pairs are likely to appear consecutively on the sequence. We considered not only the pair of (s i , s j ) but also its adjacent pairs, i.e., (s i-1 , s j+1 ) and (s i+1 , s j-1 ). If (s i , s j ) and at least one of those adjacent pairs are valid, we implemented x ij and y ij as decision variables.

Tests
Prediction accuracy of our method was measured by the sensitivity and specificity. Specificity is defined as the proportion of the number of correctly predicted base pairs to the number of base pairs of the known structure. Specificity is defined as the proportion of the number of correctly predicted base pairs to the total number of base pairs predicted by the algorithm.
We carried out pseudoknotted structure prediction with each value of  in order to determine the most proper value (see Table 2 . We then chose  that yielded the highest average prediction accuracy on our data set. From Table 2 , the best value of  is 0.70. However, increasing or decreasing value of  did not show any definitive way to find the optimal value of .

Illustration of several constraints
We compared the prediction results on the best  ( = 0.70) of IP with the results from ILM [7], pknotsRG [6] and PKNOTS [5], shown in Table 3 . Note that PKB number is an identification number of each RNA sequence used in PseudoBase. Moreover, we compared computation time of each method (see Table 4 ). As the table shows, computation time of the IP-based method depends on the sequence length. Specifically, as the sequence length elongates, the computation time increases exponentially.
We also tested the IP-based model with the pseudoknotfree sequences. The IP-based model with  = 0.7 gives 63.30% of average sensitivity and 47.89% of average specificity. Since the input data set is a set of pseudoknot-free sequences, the appropriate value of  should be close to 0. Hence, we tested the model with   {0.05, 0.1, 0.15, 0.2}. The results showed that  = 0.1 gives the best average sensitivity (72.18%) and the best average specificity (55.93%).

Discussion
We averaged the sensitivity and specificity to examine the overall performance of each prediction method. Our IPbased method gives 75.91% sensitivity, which is the highest of the four models. For specificity, the IP-based method gives 65.40%, which is the third highest, and the best specificity is given by PKNOTS (75.09%). From Table  3 , it is obvious that for short sequences (less than 70 bases), the accuracy of IP is very high and at least comparable to the other algorithms. For sequences of lengths 70-116 bases, ILM yields the highest sensitivity and specificity for most of the sequences in this group. One reason might be that some statistical information that ILM uses plays a key role in achieving good accuracy for long sequences. It is a challenge to incorporate some kind of statistical information into our IP-based method.
The reason why the IP-based method does not yield high specificity could be that when IP is being optimized, x ij and y ij are assigned to be 1 as many as possible because the energy parameters are negative values and the objective function is to minimize the overall energy. This results in the occurrence of a number of false positive pairs. Therefore, the specificity of our method is lower than that of the other methods.
When we observed predicted structures, we found that IP always outputs pseudoknotted structure. Although ILM and PKNOTS provide good specificity, it is not guaranteed that base pairs forming pseudoknots are always predicted, especially for short sequences (see Figure 5). Since the proportion of base pairs that constitute pseudoknots in RNA secondary structure is small compared to the total number of base pairs, the specificity of those algorithms is high.
We considered using the leave-one-out strategy to verify the optimality of , resulting in a drop in accuracy (70.19% average sensitivity and 60.33% average specificity). Among 34 leave-one-out experiments, there are 28 experiments where  = 0.70 yields the best prediction result. However, there is one experiment that gives low accuracy (11.11% sensitivity and 8.70% specificity), where PKB147 is used as a test sequence and  = 0.75. As a result, the average sensitivity and specificity is lower than the results shown in Table 3 where we do not perform the cross validation for . As stated before,  is a fixed weighting parameter to restrict the occurrence of base pairs that form pseudoknots. It is not necessary to train the value of  because our main purpose is not to optimize the  parameter but to test the applicability of the IP-based model to the secondary structure prediction problem. If we aim at training parameters of the prediction algorithm, we should consider not only  but also the other coefficients of the objective function (i.e., energy parameters), which is left as future work.
As explained in Implementation subsection, we implemented decision variables x ij and y ij based on two valid consecutive base pairs. It should be noted that the number of these variables is fewer than the number of original variables before the reduction (seeTable 5 ). Such a reduction contributes to lower memory usage, which leads to the capability of dealing with long sequences for prediction. Furthermore, we also considered three valid consecutive pairs, i.e., (s i-2 , s j+2 ), (s i-1 , s j+1 ), (s i+1 , s j-1 ), and (s i+2 , s j-2 ) for each pair of (s i , s j ) so that we can further reduce the number of variables and can expect to increase the prediction accuracy. However, the performance using two consecutive pairs was better than using three consecutive pairs. This might reveal that reducing too many variables could make the problem harder to optimize, which results in worse prediction accuracy and more computation time. Note that in this paper we only show the results using variables on two consecutive pairs.
Since actual RNA structure does not necessarily have the lowest energy, it is important to consider suboptimal structures, which might improve prediction accuracy. From this viewpoint, Zuker [21] proposed an efficient suboptimal folding algorithm. In our case, the optimal solution depends on how the solver (CPLEX) works out the IP problem, which is hidden to us, and thus we cannot retrieve suboptimal solutions from the solver. However, once an optimal solution (structure) is determined by the solver, we might be able to calculate some of the subopti-mal structures by describing constraints where some stacking pairs are forbidden.

Conclusion
We proposed an integer programming (IP)-based method of predicting RNA secondary structure with a certain kind of recursive pseudoknot. Prediction tests on a set of RNA sequences were carried out, which showed good performance in accuracy for a data set of relatively small length. Furthermore, our method achieved higher average sensitivity than that of several existing prediction methods for the same test set, as well as guaranteed that base pairs involved in pseudoknots are always predicted. We also tested the IP-based model with pseudoknot-free sequences. Although this IP formulation is specifically designed for pseudoknotted structure, the results showed that the IP-based method can also be useful in predicting secondary structure in the absence of pseudoknots. However, based on this IP formulation, we can formulate another IP-based model to predict pseudoknot-free structure and would be able to obtain better prediction accuracy.
As described before, our IP-based approach is flexible and extensible. Recall that the IP-based prediction method takes much time for long RNA sequences. In general, computation time of IP is exponential to the number of variables and thus it is important to reduce the number of variables. Although in this paper we used an IP variable (e.g., x ij ) to represent just one base pair, we would be able to define variables in such a way that they can handle larger units at a time, which results in further reduction of the number of variables. A kind of divide-and-conquer approach could also be useful where a long input sequence is divided into several subsequences of short length and then we apply the IP-based prediction method to each short subsequence. This approach will shorten the computation time, as well as increase the prediction accuracy, which is left as our future work. Comparison of predicted structure for PKB124 between the four algorithms Figure 5 Comparison of predicted structure for PKB124 between the four algorithms.