- Research
- Open Access
Prediction of RNA secondary structure with pseudoknots using integer programming
- Unyanee Poolsap^{1},
- Yuki Kato^{1}Email author and
- Tatsuya Akutsu^{1}
https://doi.org/10.1186/1471-2105-10-S1-S38
© Poolsap et al; licensee BioMed Central Ltd. 2009
- Published: 30 January 2009
Abstract
Background
RNA secondary structure prediction is one major task in bioinformatics, and various computational methods have been proposed so far. Pseudoknot is one of the typical substructures appearing in several RNAs, and plays an important role in some biological processes. Prediction of RNA secondary structure with pseudoknots is still challenging since the problem is NP-hard when arbitrary pseudoknots are taken into consideration.
Results
We introduce a new method of predicting RNA secondary structure with pseudoknots based on integer programming. In our formulation, we aim at minimizing the value of the objective function that reflects free energy of a folding structure of an input RNA sequence. We focus on a practical class of pseudoknots by setting constraints appropriately. Experimental results for a set of real RNA sequences show that our proposed method outperforms several existing methods in sensitivity. Furthermore, for a set of sequences of small length, our approach achieved good performance in both sensitivity and specificity.
Conclusion
Our integer programming-based approach for RNA structure prediction is flexible and extensible.
Keywords
- Integer Programming
- Integer Programming Problem
- Integer Programming Formulation
- Pseudoknotted Structure
- Grammatical Approach
Background
Many single-stranded RNAs are considered to fold back on themselves to be thermodynamically stable. This idea has led to the development of several algorithms that minimize the equilibrium free energy of RNA. In the pioneering work, we refer to the Zuker's algorithm [3] that predicts secondary structure with the lowest free energy by using a dynamic programming (DP) technique, and its implementation named Mfold [4] is one of the major softwares for RNA secondary structure prediction. The time complexity of the Zuker's algorithm is O(n^{3}) where n is the length of an input RNA sequence. However, this algorithm cannot deal with pseudoknots. To predict pseudoknotted structure, several DP algorithms that minimize free energy in the range from O(n^{4}) to O(n^{6}) time were proposed including PKNOTS [5], pknotsRG [6] and iterated loop matching (ILM) [7]. These algorithms focus on some restricted pseudoknots for solvability in polynomial time since prediction of arbitrary pseudoknotted structure was proven to be NP-hard [8, 9].
There have also been several grammatical approaches to modeling some kinds of pseudoknots, including use of tree adjoining grammar [10, 11], crossed-interaction grammar [12], parallel communicating grammar [13] and multiple context-free grammar [14]. In these grammatical approaches, secondary structure prediction can be interpreted as parsing of the grammars, which can be addressed in O(n^{4}) to O(n^{6}) time. Although performance of these grammar-based methods is comparable to that of energy-based methods, grammatical approaches have the advantage of the capability of constructing profile (or modeling consensus structure) when multiple structural alignment is given in advance.
The sequence alignment information is also useful for RNA secondary structure prediction. Sankoff [15] proposed a DP algorithm that can simultaneously solve the sequence alignment and folding problems for multiple sequences. Later, Sankoff's algorithm was improved by Mathews and Turner [16], where their proposed algorithm is called Dynalign. Dynalign combines free energy minimization and comparative sequence analysis to find a low free energy structure common to two sequences without requiring any sequence identity.
Recently, Parisien and Major [17] presented a new analysis of RNA secondary structure prediction by incorporating all base pairs including noncanonical ones. They introduced a new representation of the nucleotide relationships in structural RNAs, called the nucleotide cyclic motif (NCM). NCM is incorporated into the scoring function of a folding algorithm, which is also based on DP.
Returning to the viewpoint of energy minimization, RNA secondary structure prediction can be regarded as a kind of optimization problem. In fact, several problems in bioinformatics can be formulated as combinatorial optimization problems. One of the successful applications is RAPTOR [18], which calculates protein threading using an integer programming (IP) formulation. Although IP problems are known to be NP-hard, descriptive power of IP is strong and flexible. In addition, recent commercial optimization softwares can deal with relatively large-scale instances even if a problem is computationally hard to solve.
In this paper, we present a new method of predicting RNA secondary structure with pseudoknots based on IP. In our model, thermodynamic information is incorporated into the objective function whose value is to be minimized, and structural information is represented by constraints. It should be noted that we focus on a practical class of pseudoknots by setting constraints appropriately. We use the CPLEX software [19] to solve the IP problem and evaluate our method on a set of RNA sequences contained in PseudoBase [2] and Rfam [20]. Advantages of our proposed method are summarized as follows:
• Our prediction method using IP is flexible. Specifically, various types of secondary structures can be handled by adding or removing constraints. In fact, we can describe pseudoknot-free structures as well as generalized planar pseudoknots (see [8] for definition) by IP formulations, and in this paper we provide a modeling of a certain type of recursive pseudoknot (see Methods).
• The IP-based method outperforms three existing methods PKNOTS, pknotsRG and ILM in sensitivity for 34 RNA sequences known to have pseudoknots. In particular, for a set of sequences of small length (≤ 46), our approach achieved good performance in both sensitivity and specificity.
• Once we can model the prediction problem by an IP formulation, we need not develop any algorithm from scratch because high-performance solvers for IP are now available. Thus, we can focus on how to describe the topology of structure and how to assign appropriate scores to the model.
• The IP-based approach is extensible. If we have incomplete data on secondary structure where parts of the structure (i.e., a set of base pairs) have been determined by experiment, we can complement the incomplete structure by modeling the known parts of the structure as constraints and solving the IP problem.
Methods
Definitions of RNA secondary structure and pseudoknot
In this section we will describe the preliminary definitions of RNA secondary structure including pseudoknots.
Definition 1 (RNA secondary structure). An RNA sequence is represented by a string of n characters s = s_{1}s_{2} ⋯ s_{ n }where s_{ i }∈ {A, C, G, U}. A secondary structure of the sequence s is defined as a set S of base pairs (s_{ i }, s_{ j }) such that the following conditions are satisfied:
1. 1 ≤ i <j ≤ n, meaning, two bases that form a pair must be located at different positions.
2. j - i > t where t is a small positive constant, meaning, the sequence does not fold too sharply on itself.
3. For all base pairs (s_{ i }, s_{ j }) and (s_{ i }', s_{ j }') in S, i = i' if and only if j = j', meaning, each base can be paired with at most one base.
Here, we allow only Watson-Crick base pairs (A, U) and (C, G), and a wobble base pair (G, U) to form the secondary structure, which we call valid base pairs.
Definition 2 (Pseudoknot). An RNA secondary structure S is said to contain a pseudoknot if and only if there exist (s_{ i }, s_{ j }), (s'_{ i }, s'_{ j }) ∈ S (i <i') such that i <i' <j <j'.
- 1.
i <j <i' <j', i.e., (s_{ i }, s_{ j }) precedes (s_{ i' }, s_{ j' }), or
- 2.
i <i' <j' <j, i.e., (s_{ i }, s_{ j }) includes (s_{ i' }, s_{ j' }).
There are various kinds of pseudoknots, depending on how arcs representing base pairs cross above the sequence. For tractability in computational complexity, several classes of pseudoknots were proposed.
- 1.
Each (i, j) ∈ ${{S}^{\prime}}_{{i}_{0},{k}_{0}}$ satisfies either i_{0} ≤ i <${{j}^{\prime}}_{0}$ ≤ j <j_{0} or ${{j}^{\prime}}_{0}$ ≤ i <j_{0} ≤ j ≤ k_{0}.
- 2.
There exists (i, j) ∈ ${{S}^{\prime}}_{{i}_{0},{k}_{0}}$ satisfying i_{0} ≤ i <${{j}^{\prime}}_{0}$ ≤ j <j_{0}.
- 3.
There exists (i, j) ∈ ${{S}^{\prime}}_{{i}_{0},{k}_{0}}$ satisfying ${{j}^{\prime}}_{0}$ ≤ i <j_{0} ≤ j ≤ k_{0}.
- 4.
If pairs (i, j) and (i', j') in ${{S}^{\prime}}_{{i}_{0},{k}_{0}}$ satisfy either i <i' <${{j}^{\prime}}_{0}$ or ${{j}^{\prime}}_{0}$ ≤ i < i', then j > j' holds.
Definition 5 (Recursive pseudoknot [8]). If internal unpaired regions of a simple pseudoknot (e.g., subsequences u, v and w in Figure 2(a)) fold into pseudoknot-free structure and/or simple pseudoknot, the structure is called a recursive pseudoknot (see Figure 2(b)).
In this paper, we propose an integer programming-based method that can predict a subclass of recursive pseudoknots. It should be noted that our formulation allows unpaired regions of a simple pseudoknot to fold into pseudoknot-free structure only.
Integer programming-based model
Definitions of integer programming
where a_{ ij }, b_{ i }, c_{ j }∈ ℝ (i = 1, 2, ..., m; j = 1, 2, ..., n) and x_{ j }(j = 1, 2, ..., n) denotes a decision variable defined over a set of nonnegative integers ℤ^{+}. Note that the maximization problem is equivalent to the minimization problem where the sign of the objective function is inverted.
Formulation for RNA pseudoknot prediction
We will formulate the RNA pseudoknotted structure prediction problem as an IP problem. Specifically, we define the decision variables, the objective function and the linear constraints with respect to the decision variables.
Stacking energy parameter matrix E [4].
A-U | C-G | G-C | G-U | U-G | U-A | |
---|---|---|---|---|---|---|
A-U | -1.1 | -2.1 | -2.2 | -1.4 | -0.9 | -0.6 |
C-G | -2.1 | -2.4 | -3.3 | -2.1 | -2.1 | -1.4 |
G-C | -2.2 | -3.3 | -3.4 | -2.5 | -2.4 | -1.5 |
G-U | -1.4 | -2.1 | -2.5 | -1.3 | -1.3 | -0.5 |
U-G | -0.9 | -2.1 | -2.4 | -1.3 | -1.3 | -1.0 |
U-A | -0.6 | -1.4 | -1.5 | -0.5 | -1.0 | -0.3 |
for i, j = 1, 2, ..., n and k, l = 1, 2, ..., 6 (see Figure 3(b)). Note that the above definitions include illegal variables x_{i+1,0}and x_{n+1,j-1}for each i and j respectively (which also applies to y_{ ij }), though we allow this notation for simplicity.
Let ${L}_{{x}_{i}}$ = 1 and ${R}_{{x}_{i}}$ = 1 if and only if the base s_{ i }pairs with some base at any other position greater than i and less than i respectively. In other words, ${L}_{{x}_{i}}$ = 1 means that s_{ i }is on the left side of a base pair and ${R}_{{x}_{i}}$ = 1 means that s_{ i }is on the right side of a base pair. This set of variables also applies to y_{ ij }, defined by ${L}_{{y}_{i}}$ and ${R}_{{y}_{i}}$.
((s_{ i },s_{ j }) is type k and (s_{i+1},s_{j-1}) is type l),
((s_{ i },s_{ j }) is type k and (s_{i+1},s_{j-1}) is type l),
x_{ ij }+ y_{ ij }≤ 1,
x_{ ij }+ x_{i'j'}≤ 1 (∀i <i' <j <j'),
y_{ ij }+ y_{i'j'}≤ 1 (∀i <i' <j <j'),
y_{ ij }+ y_{i'j'}≤ 1 (∀i <j <i' <j'),
The first term in summations of the objective function is the sum of the overall energy of stacking base pairs formed above the sequence, while the second term is the sum of the overall energy of the stacking base pairs formed below the sequence. We multiply the second term by a positive weighting parameter σ. The weighting parameter restricts the occurrence of base pairs below the sequence. This is to take into account that pseudoknotted structure frequently does not appear in nature. The value of σ suggested in [5] is σ < 1, though the early work employed a dynamic programming-based approach. In our model, we define σ ∈ {0.55, 0.6, 0.65, 0.7, 0.75, 0.8}.
Constraints (1) and (2) mean that if (s_{ i }, s_{ j }) is a valid pair with type k and (s_{i+1}, s_{j-1}) is a valid pair with type l, the energy parameter associated with (k, l) stacking type will contribute to the total energy of the structure (Figure 3(b)). Constraints (3) and (4) say that each base can be paired with at most one base regardless of whether the pair is formed above or below the sequence (Figure 4(a)). Constraints (5) and (6) inhibit crossing pairs both above the sequence and below the sequence (Figure 4(b)). In constraints (7), (8), (9) and (10), ${L}_{{x}_{i}}$, ${R}_{{x}_{i}}$, ${L}_{{y}_{i}}$ and ${R}_{{y}_{i}}$ are defined respectively. Constraints (11), (12), (13) and (14) guarantee that if a base is paired with another one, its adjacent base must also form a base pair (Figure 4(c)). The purpose of these constraints is to promote stacking pairs because they are known to help to stabilize the structure. Constraints (15) and (16) define ${L}_{{t}_{ij}}$ and ${R}_{{t}_{ij}}$ respectively. In the constraint (17), if (s_{ i }, s_{ j }) is a base pair formed below the sequence, either ${L}_{{t}_{ij}}$ or ${R}_{{t}_{ij}}$ can take the value 1. This means that the crossing region can occur once at a time (Figure 4(d)). Notice that base pairs can be formed recursively in the loop regions, for example, the region from i' to i and the region from i to j' in Figure 4(d). By virtue of constraints (15), (16), (17), this IP model can handle a subclass of recursive pseudoknots where internal unpaired regions of a simple pseudoknot can have pseudoknot-free structure. The constraint (18) is used for prohibiting bifurcation structure from occurring below the sequence (Figure 4(e)). The constraint (19) guarantees all variables to be either 0 or 1.
Results and discussion
Data set
Test sequences that are known to form pseudoknots were taken from PseudoBase [2], where we selected 34 sequences of several kinds of secondary structures and of various lengths (21–137 bases) from different families (viral 3'UTR, mRNA, rRNA, ribozymes and tRNA-like). Specifically, we first selected a set of sequences of lengths 20–140 bases uniformly. We then checked the secondary structure of each sequence and removed sequences of similar secondary structure from the set.
Other test sequences that do not contain any pseudoknots were obtained from Rfam [20]. Seven pseudoknot-free sequences from different families were selected in the same way as the sequences that contain pseudoknots, so that their secondary structures and lengths are different (26–88 bases).
Implementation
After formulating the IP problem for prediction of RNA secondary structure, we employed the optimizer software called ILOG CPLEX 10.1 [19] to solve the IP formulation. CPLEX is a commercial software that can solve mathematical optimization problems, including IP problems. We implemented the IP formulation by C++ and included the C++ optimization library provided by CPLEX on a machine with Intel Xeon CPU 5160 3.00 GHz and 8.00 GB RAM.
In our implementation, we reduced the set of variables x_{ ij }and y_{ ij }to solve the problem fast. During implementation of x_{ ij }and y_{ ij }, we focused on the fact that at least two base pairs are likely to appear consecutively on the sequence. We considered not only the pair of (s_{ i }, s_{ j }) but also its adjacent pairs, i.e., (s_{i-1}, s_{j+1}) and (s_{i+1}, s_{j-1}). If (s_{ i }, s_{ j }) and at least one of those adjacent pairs are valid, we implemented x_{ ij }and y_{ ij }as decision variables.
Tests
Prediction accuracy of our method was measured by the sensitivity and specificity. Specificity is defined as the proportion of the number of correctly predicted base pairs to the number of base pairs of the known structure. Specificity is defined as the proportion of the number of correctly predicted base pairs to the total number of base pairs predicted by the algorithm.
Average sensitivity and specificity of each value of σ.
σ | Avg. sensitivity (%) | Avg. specificity (%) |
---|---|---|
0.55 | 75.67 | 64.92 |
0.60 | 72.70 | 62.21 |
0.65 | 70.53 | 60.86 |
0.70 | 75.91 | 65.40 |
0.75 | 75.09 | 65.02 |
0.80 | 70.53 | 60.89 |
Prediction results of the IP-based method (σ = 0.7) and prediction results of the other algorithms.
PKB num. | Length | Sensitivity (%) | Specificity (%) | ||||||
---|---|---|---|---|---|---|---|---|---|
IP | ILM | pknotsRG | PKNOTS | IP | ILM | pknotsRG | PKNOTS | ||
PKB115 | 21 | 100.00 | 66.67 | 100.00 | 66.67 | 100.00 | 100.00 | 100.00 | 100.00 |
PKB102 | 24 | 100.00 | 87.50 | 100.00 | 87.50 | 80.00 | 77.78 | 88.89 | 77.78 |
PKB119 | 24 | 88.89 | 0.00 | 77.78 | 66.67 | 88.89 | 0.00 | 87.50 | 100.00 |
PKB103 | 25 | 100.00 | 57.14 | 57.14 | 42.86 | 70.00 | 66.67 | 50.00 | 50.00 |
PKB123 | 26 | 90.00 | 90.00 | 70.00 | 60.00 | 90.00 | 90.00 | 77.78 | 85.71 |
PKB154 | 26 | 100.00 | 90.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
PKB152 | 26 | 100.00 | 90.00 | 100.00 | 100.00 | 90.91 | 90.00 | 90.91 | 90.91 |
PKB126 | 27 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
PKB124 | 29 | 100.00 | 70.00 | 100.00 | 70.00 | 100.00 | 100.00 | 100.00 | 100.00 |
PKB100 | 31 | 100.00 | 91.67 | 91.67 | 100.00 | 92.31 | 91.67 | 91.67 | 100.00 |
PKB105 | 32 | 88.89 | 88.89 | 88.89 | 88.89 | 88.89 | 100.00 | 100.00 | 100.00 |
PKB118 | 33 | 90.00 | 80.00 | 90.00 | 80.00 | 81.82 | 72.73 | 81.82 | 72.73 |
PKB120 | 36 | 100.00 | 100.00 | 100.00 | 100.00 | 85.71 | 100.00 | 100.00 | 100.00 |
PKB65 | 46 | 93.33 | 93.33 | 0.00 | 73.33 | 70.00 | 87.50 | 0.00 | 68.75 |
PKB205 | 48 | 23.08 | 0.00 | 23.08 | 23.08 | 15.00 | 0.00 | 23.08 | 23.08 |
PKB147 | 51 | 77.78 | 55.56 | 50.00 | 50.00 | 63.64 | 55.56 | 47.37 | 52.94 |
PKB248 | 66 | 80.00 | 65.00 | 35.00 | 65.00 | 64.00 | 72.22 | 33.33 | 61.90 |
PKB72 | 67 | 100.00 | 0.00 | 35.29 | 100.00 | 62.96 | 0.00 | 31.58 | 73.91 |
PKB140 | 69 | 73.91 | 65.22 | 13.04 | 43.48 | 62.96 | 65.22 | 15.79 | 50.00 |
PKB143 | 71 | 62.50 | 91.67 | 29.17 | 70.83 | 48.39 | 78.57 | 29.17 | 68.00 |
PKB144 | 71 | 79.17 | 95.83 | 29.17 | 66.67 | 70.37 | 95.83 | 31.82 | 72.73 |
PKB173 | 73 | 63.64 | 77.27 | 36.36 | 77.27 | 50.00 | 68.00 | 44.44 | 77.27 |
PKB276 | 73 | 42.86 | 76.19 | 33.33 | 38.10 | 30.00 | 80.00 | 30.43 | 38.10 |
PKB275 | 85 | 65.38 | 96.15 | 15.38 | 50.00 | 50.00 | 83.33 | 15.38 | 72.22 |
PKB75 | 88 | 71.88 | 93.75 | 18.75 | 81.25 | 62.16 | 85.71 | 22.22 | 81.25 |
PKB76 | 89 | 56.00 | 100.00 | 52.00 | 44.00 | 40.00 | 69.44 | 48.15 | 34.38 |
PKB164 | 96 | 38.71 | 74.19 | 48.39 | 35.48 | 29.27 | 65.71 | 53.57 | 34.38 |
PKB168 | 105 | 61.76 | 79.41 | 35.29 | 76.47 | 50.00 | 81.82 | 34.29 | 86.67 |
PKB252 | 110 | 15.38 | 97.44 | 35.90 | 97.44 | 13.04 | 80.85 | 35.90 | 92.68 |
PKB191 | 113 | 74.36 | 82.05 | 56.41 | 71.79 | 61.70 | 80.00 | 55.00 | 73.68 |
PKB135 | 116 | 82.05 | 84.62 | 38.46 | 74.36 | 69.57 | 84.62 | 40.54 | 74.36 |
PKB236 | 120 | 50.00 | 37.50 | 0.00 | 67.50 | 42.55 | 30.61 | 0.00 | 72.97 |
PKB137 | 133 | 70.45 | 84.09 | 63.64 | 86.36 | 65.96 | 80.43 | 63.64 | 86.36 |
PKB134 | 137 | 40.91 | 59.09 | 40.91 | 84.09 | 33.33 | 50.00 | 41.86 | 80.43 |
Average | 75.91 | 74.12 | 54.85 | 71.74 | 65.40 | 73.07 | 54.89 | 75.09 |
Computation time of the IP-based method (σ = 0.7) and the other algorithms. Note that it is difficult to measure the exact computation time of pknotsRG since it is provided as an interactive system, and the time listed below is a rough estimate when running pknotsRG.
Seq. length | Avg. computation time (sec.) | |||
---|---|---|---|---|
IP | ILM | pknotsRG | PKNOTS | |
21–46 | 0.14 | 0.05 | <1.00 | 0.73 |
48–66 | 16.40 | 0.15 | <1.00 | 24.67 |
67–137 | 6641.53 | 0.31 | <1.00 | 1205.61 |
We also tested the IP-based model with the pseudoknot-free sequences. The IP-based model with σ = 0.7 gives 63.30% of average sensitivity and 47.89% of average specificity. Since the input data set is a set of pseudoknot-free sequences, the appropriate value of σ should be close to 0. Hence, we tested the model with σ ∈ {0.05, 0.1, 0.15, 0.2}. The results showed that σ = 0.1 gives the best average sensitivity (72.18%) and the best average specificity (55.93%).
Discussion
We averaged the sensitivity and specificity to examine the overall performance of each prediction method. Our IP-based method gives 75.91% sensitivity, which is the highest of the four models. For specificity, the IP-based method gives 65.40%, which is the third highest, and the best specificity is given by PKNOTS (75.09%). From Table 3 , it is obvious that for short sequences (less than 70 bases), the accuracy of IP is very high and at least comparable to the other algorithms. For sequences of lengths 70–116 bases, ILM yields the highest sensitivity and specificity for most of the sequences in this group. One reason might be that some statistical information that ILM uses plays a key role in achieving good accuracy for long sequences. It is a challenge to incorporate some kind of statistical information into our IP-based method.
The reason why the IP-based method does not yield high specificity could be that when IP is being optimized, x_{ ij }and y_{ ij }are assigned to be 1 as many as possible because the energy parameters are negative values and the objective function is to minimize the overall energy. This results in the occurrence of a number of false positive pairs. Therefore, the specificity of our method is lower than that of the other methods.
We considered using the leave-one-out strategy to verify the optimality of σ, resulting in a drop in accuracy (70.19% average sensitivity and 60.33% average specificity). Among 34 leave-one-out experiments, there are 28 experiments where σ = 0.70 yields the best prediction result. However, there is one experiment that gives low accuracy (11.11% sensitivity and 8.70% specificity), where PKB147 is used as a test sequence and σ = 0.75. As a result, the average sensitivity and specificity is lower than the results shown in Table 3 where we do not perform the cross validation for σ. As stated before, σ is a fixed weighting parameter to restrict the occurrence of base pairs that form pseudoknots. It is not necessary to train the value of σ because our main purpose is not to optimize the σ parameter but to test the applicability of the IP-based model to the secondary structure prediction problem. If we aim at training parameters of the prediction algorithm, we should consider not only σ but also the other coefficients of the objective function (i.e., energy parameters), which is left as future work.
Number of variables before and after implementing variable reduction.
Seq. length | 21 | 137 |
---|---|---|
Before reduction | 12244 | 400984 |
After reduction | 2938 | 117240 |
Since actual RNA structure does not necessarily have the lowest energy, it is important to consider suboptimal structures, which might improve prediction accuracy. From this viewpoint, Zuker [21] proposed an efficient suboptimal folding algorithm. In our case, the optimal solution depends on how the solver (CPLEX) works out the IP problem, which is hidden to us, and thus we cannot retrieve suboptimal solutions from the solver. However, once an optimal solution (structure) is determined by the solver, we might be able to calculate some of the suboptimal structures by describing constraints where some stacking pairs are forbidden.
Conclusion
We proposed an integer programming (IP)-based method of predicting RNA secondary structure with a certain kind of recursive pseudoknot. Prediction tests on a set of RNA sequences were carried out, which showed good performance in accuracy for a data set of relatively small length. Furthermore, our method achieved higher average sensitivity than that of several existing prediction methods for the same test set, as well as guaranteed that base pairs involved in pseudoknots are always predicted.
We also tested the IP-based model with pseudoknot-free sequences. Although this IP formulation is specifically designed for pseudoknotted structure, the results showed that the IP-based method can also be useful in predicting secondary structure in the absence of pseudoknots. However, based on this IP formulation, we can formulate another IP-based model to predict pseudoknot-free structure and would be able to obtain better prediction accuracy.
As described before, our IP-based approach is flexible and extensible. Recall that the IP-based prediction method takes much time for long RNA sequences. In general, computation time of IP is exponential to the number of variables and thus it is important to reduce the number of variables. Although in this paper we used an IP variable (e.g., x_{ ij }) to represent just one base pair, we would be able to define variables in such a way that they can handle larger units at a time, which results in further reduction of the number of variables. A kind of divide-and-conquer approach could also be useful where a long input sequence is divided into several subsequences of short length and then we apply the IP-based prediction method to each short subsequence. This approach will shorten the computation time, as well as increase the prediction accuracy, which is left as our future work.
Declarations
Acknowledgements
The authors gratefully acknowledge helpful discussions with Professor Hiroyuki Seki of Nara Institute of Science and Technology and Professor Kiyoshi Asai of the University of Tokyo on several points in the paper. The first author thanks the Japanese Government (Monbukagakusho: MEXT) Scholarship from the Ministry of Education, Culture, Sports, Science and Technology.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1
Authors’ Affiliations
References
- Shabalina SA, Spiridonov NA: The mammalian transcriptome and the function of non-coding DNA sequences. Genome Biology. 2004, 5 (4): 105-10.1186/gb-2004-5-4-105.PubMed CentralView ArticlePubMedGoogle Scholar
- Batenburg FHDvan, Gultyaev AP, Pleij CWA, Ng J, Oliehoek J: PseudoBase: a database with RNA pseudoknots. Nucleic Acids Research. 2000, 28: 201-204. 10.1093/nar/28.1.201.PubMed CentralView ArticlePubMedGoogle Scholar
- Zuker M, Stiegler P: Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Research. 1981, 9: 133-148. 10.1093/nar/9.1.133.PubMed CentralView ArticlePubMedGoogle Scholar
- Zuker M: Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Research. 2003, 31: 3406-3415. 10.1093/nar/gkg595.PubMed CentralView ArticlePubMedGoogle Scholar
- Rivas E, Eddy SR: A dynamic programming algorithm for RNA structure prediction including pseudoknots. Journal of Molecular Biology. 1999, 285: 2053-2068. 10.1006/jmbi.1998.2436.View ArticlePubMedGoogle Scholar
- Reeder J, Giegerich R: Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinformatics. 2004, 5:Google Scholar
- Ruan J, Stormo GD, Zhang W: An iterated loop matching approach to the prediction of RNA secondary structures with pseudoknots. Bioinformatics. 2004, 20: 58-66. 10.1093/bioinformatics/btg373.View ArticlePubMedGoogle Scholar
- Akutsu T: Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots. Discrete Applied Mathematics. 2000, 104: 45-62. 10.1016/S0166-218X(00)00186-4.View ArticleGoogle Scholar
- Lyngsø RB, Pedersen CNS: RNA pseudoknot prediction in energy-based models. Journal of Computational Biology. 2000, 7: 409-427. 10.1089/106652700750050862.View ArticlePubMedGoogle Scholar
- Uemura Y, Hasegawa A, Kobayashi S, Yokomori T: Tree adjoining grammars for RNA structure prediction. Theoretical Computer Science. 1999, 210: 277-303. 10.1016/S0304-3975(98)00090-5.View ArticleGoogle Scholar
- Matsui H, Sato K, Sakakibara Y: Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures. Bioinformatics. 2005, 21: 2611-2617. 10.1093/bioinformatics/bti385.View ArticlePubMedGoogle Scholar
- Rivas E, Eddy SR: The language of RNA: a formal grammar that includes pseudoknots. Bioinformatics. 2000, 16: 334-340. 10.1093/bioinformatics/16.4.334.View ArticlePubMedGoogle Scholar
- Cai L, Malmberg RL, Wu Y: Stochastic modeling of RNA pseudoknotted structures: a grammatical approach. Bioinformatics. 2003, 19 (Suppl 1): i66-i73. 10.1093/bioinformatics/btg1007.View ArticlePubMedGoogle Scholar
- Kato Y, Seki H, Kasami T: RNA pseudoknotted structure prediction using stochastic multiple context-free grammar. IPSJ Transactions on Bioinformatics. 2006, 47: 12-21.Google Scholar
- Sankoff D: Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM Journal on Applied Mathematics. 1985, 45: 810-825. 10.1137/0145048.View ArticleGoogle Scholar
- Mathews DH, Turner DH: Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. Journal of Molecular Biology. 2002, 317: 191-203. 10.1006/jmbi.2001.5351.View ArticlePubMedGoogle Scholar
- Parisien M, Major F: The MC-Fold and MC-Sym pipeline infers RNA structure from sequence data. Nature. 2008, 452: 51-55. 10.1038/nature06684.View ArticlePubMedGoogle Scholar
- Xu J, Li M, Kim D, Xu Y: RAPTOR: optimal protein threading by linear programming. J Bioinform Comput Biol. 2003, 1: 95-117. 10.1142/S0219720003000186.View ArticlePubMedGoogle Scholar
- ILOG CPLEX. [http://www.ilog.com/products/cplex/]
- Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A: Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 2005, 33 (Database issue): D121-D124. 10.1093/nar/gki081.PubMed CentralView ArticlePubMedGoogle Scholar
- Zuker M: On finding all suboptimal foldings of an RNA molecule. Science. 1989, 244: 48-52. 10.1126/science.2468181.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.