 Research Article
 Open Access
Biobjective integer programming for RNA secondary structure prediction with pseudoknots
 Audrey Legendre^{1},
 Eric Angel^{1} and
 Fariza Tahi^{1}Email author
https://doi.org/10.1186/s1285901820077
© The Author(s) 2018
 Received: 4 July 2017
 Accepted: 2 January 2018
 Published: 15 January 2018
Abstract
Background
RNA structure prediction is an important field in bioinformatics, and numerous methods and tools have been proposed. Pseudoknots are specific motifs of RNA secondary structures that are difficult to predict. Almost all existing methods are based on a single model and return one solution, often missing the real structure. An alternative approach would be to combine different models and return a (small) set of solutions, maximizing its quality and diversity in order to increase the probability that it contains the real structure.
Results
We propose here an original method for predicting RNA secondary structures with pseudoknots, based on integer programming. We developed a generic biobjective integer programming algorithm allowing to return optimal and suboptimal solutions optimizing simultaneously two models. This algorithm was then applied to the combination of two known models of RNA secondary structure prediction, namely MEA and MFE. The resulting tool, called BiokoP, is compared with the other methods in the literature. The results show that the best solution (structure with the highest F_{1}score) is, in most cases, given by BiokoP. Moreover, the results of BiokoP are homogeneous, regardless of the pseudoknot type or the presence or not of pseudoknots. Indeed, the F_{1}scores are always higher than 70% for any number of solutions returned.
Conclusion
The results obtained by BiokoP show that combining the MEA and the MFE models, as well as returning several optimal and several suboptimal solutions, allow to improve the prediction of secondary structures. One perspective of our work is to combine better monocriterion models, in particular to combine a model based on the comparative approach with the MEA and the MFE models. This leads to develop in the future a new multiobjective algorithm to combine more than two models. BiokoP is available on the EvryRNA platform: https://EvryRNA.ibisc.univevry.fr.
Keywords
 RNA
 Secondary structure
 Pseudoknot
 Integer programming
 Biobjective
 Optimal solutions
 Suboptimal solutions
Background
RNAs are involved in numerous pathologies such as cancer and neurodegenerative diseases. Determining the structure of an RNA is an important step in the understanding of its biological and biochemical function, its classification and its interaction with other molecules. In this paper, we are interested in the prediction of the secondary structure of RNAs with pseudoknots. Pseudoknots can have important roles in the translation process. For example, some studies have shown that the interaction of a pseudoknot with the ribosome induces a break of the ribosome during the translation, by causing a deformation of the tRNA in the P site [1].
Predicting the secondary structure with pseudoknots of an RNA sequence is a subject which is heavily studied in the literature. In fact, this problem was proved to be NPhard for various energy models [2, 3] and, as the current provided tools are not satisfactory, it is still an open subject. Two main approaches exist for predicting RNA structures (with or without pseudoknots): the thermodynamic approach and the comparative approach. The thermodynamic approach consists in, either computing the structure of minimum free energy (MFE) according to a set of thermodynamic parameters, or computing the structure of maximum expected accuracy (MEA) with a partition function. The comparative approach consists in finding a conserved RNA structure between several species. This approach needs therefore several (homologous) sequences as inputs, unlike the first approach where only one sequence is needed.
Many tools have been proposed in the literature for predicting RNA pseudoknots. We can cite for instance tools based on MFE models [4–9], tools based on MEA models [10, 11] and tools based on the comparative approach [12, 13]. However, the results of a single given model can only approach the real structure. For example, it is now established that the real structure has a very low energy, but not necessarily the minimum one (indeed, many factors are involved, such as the environment). Approaches able to combine different models are therefore interesting. To our knowledge, very few tools have been proposed to combine different models for the prediction of secondary structures of RNAs with pseudoknots. Combination has been used for the prediction of a consensus structure of several homologous sequences, as performed in ILM [13] which combines the comparative approach with an MFE model, and in IPknot [10] which combines the comparative approach with an MEA model. An algebraic dynamic programming method [14] has also been proposed to combine the MEA and the MFE models. However, no dedicated tool is available. Moreover, very few tools, namely pKiss [4], McGenus [5] and Tfold [12], have been proposed to return several solutions of secondary structures with pseudoknots. Proposing a unique solution, the optimal one according to a given model, is restrictive, for the reasons given above. It is important to consider also suboptimal solutions. Our goal is to develop a method combining different models and returning both several optimal and several suboptimal solutions. In this paper, we are interested in the thermodynamic approach, as we consider a single RNA sequence of interest as input.
The majority of RNA secondary structure prediction tools were developed using the dynamic programming methodology [4, 5, 7, 11]. In [6] and [10], another approach was proposed: integer programming. An integer program is a mathematical formalization of a problem. It consists in an objective function to optimize on a set of integer variables, subject to a set of linear constraints. This approach is very flexible, allowing to model mathematically a large range of problems. It has been applied to various domains, from economy to industry. To our knowledge, only one team has used integer programming for RNA secondary structure prediction with pseudoknots. First they developed an integer program [6] to find the structure of MFE using the stacking energy parameters of Mfold 3.0 [15]. Then they provided the IPknot software [10] based on an MEA model using base pair probabilities computed with different models like the McCaskill [16] or the Dirks and Pierce [8] models. This team also used integer programming to predict RNARNA interactions [17]. Note that integer programming has also been employed in related domains such as multiple RNA sequencestructure alignment [18] or 3D RNA structure by inserting local 3D motifs in RNA secondary structure [19].

We developed an original generic algorithm, that allows to return several optimal and several suboptimal solutions for any BOIP.

We combined the two thermodynamic models Mod1 and Mod2 for prediction of RNA secondary structure with pseudoknots into one BOIP.

We implemented this BOIP with our generic algorithm to predict several optimal and several suboptimal RNA secondary structures. The tool is called BiokoP (Biobjective programming pseudoknot Prediction) and is available on our EvryRNA platform.
We evaluated our algorithm on a dataset of 198 pseudoknotted RNA sequences from PseudoBase++ [20]. The first observation is that the real structure is often given by a suboptimal solution, which confirms the need of returning suboptimal solutions. BiokoP was then compared with other tools proposing several solutions for pseudoknotted RNA secondary structure prediction. To our knowledge, only two tools are available in the literature, namely pKiss [4] and McGenus [5]. BiokoP was also compared to IPknot [10], in the case where one solution is returned. Considering the dataset of pseudoknotted secondary structures, BiokoP gives better F_{1}scores than the other tools. The results in function of the type of pseudoknots show that BiokoP gives homogeneous results regardless of the pseudoknot type. Indeed, the F_{1}scores are always higher than 70% for any number of solutions returned, contrary to those of pKiss and McGenus. The results also show that BiokoP is more likely to return the best structure (according to the F_{1}score) among the optimal solutions than the other tools. We also experimented BiokoP on a dataset of pseudoknotfree RNA sequences from RNA STRAND [21]. We compared BiokoP on this dataset with the other tools and with RNAsubopt [22]. RNAsubopt is able to predict pseudoknotfree structures and suboptimal solutions. The results show that BiokoP is able to predict pseudoknotfree secondary structures with F_{1}scores close to those of RNAsubopt and better than those of pKiss and McGenus.
The paper is organized as follows: in the “Methods” section, we start by giving some fundamental definitions in multiobjective optimization. We present our algorithm, which aims to compute several solutions (optimal and suboptimal), for any BOIP. Then, we present how we combined the two models Mod1 and Mod2 into a single BOIP to predict RNA secondary structures with pseudoknots. The “Results” section is devoted to the experimental evaluation of our method. Finally, we discuss about our results in the “Discussion” section and we conclude and give some perspectives in the “Conclusion” section.
Methods
Our work is based on integer programming which consists in optimizing an objective function according to linear constraints over a set of integer decision variables [23]. It allows to model very different problems. Integer programming is usually used to obtain an optimal solution, but here, the purpose is to obtain also several suboptimal solutions.
We are interested in optimizing several objective functions, corresponding here to different models for RNA secondary structure prediction. We thus have a biobjective integer program, and the set of optimal solutions is called the Pareto set. As said before, regarding our biological context, we are interested in finding optimal and suboptimal solutions. In a multicriteria setting, it means to compute suboptimal Pareto sets, namely the kbest Pareto sets for k≥1. Hence, we present a new method to generate those sets for a generic biobjective integer program (BOIP). We would like to stress out that this is a totally new problem to our knowledge, this should not be confused with the traditional problem of finding approximate Pareto sets. Indeed, in the latter approach, one wants to find an approximation of the exact Pareto set, whereas in our method we find the exact (sub)optimal Pareto sets.
The biobjective integer programming
Many methods exist to solve multiobjective combinatorial optimization problems and BOIP. There are methods for finding the exact Pareto front [24–28] or an approximation of it [29, 30]. A first difference of our approach with the majority of the above works is that we are rather interested in finding the Pareto set instead of the Pareto front, and in case there are several solutions with the same values for each objective function, we want to find them all. Another more fundamental difference is that we are also interested in computing suboptimal Pareto sets, namely the kbest Pareto sets with k≥1. For example, the second best Pareto set corresponds to the best tradeoff when the solutions belonging to the first Pareto set have been removed. In other words, when the first Pareto set is removed, the remaining nondominated solutions form the 2best Pareto set. Figure 1b shows several kbest Pareto sets.
Algorithm for finding the kbest Pareto sets
In this section, we present an original generic algorithm we developed to compute the kbest Pareto sets for any BOIP:
min f_{1}(x)
min f_{2}(x)
The constraints are described here as linear functions g_{ k } of x.
For the clarity of the presentation let us assume first that all the variables in the BOIP are binary ones. In that case, given a set F of forbidden solutions, we denote by P_{1}(λ_{ min },λ_{ max },F) the following IP:
min f_{1}(x)
In this IP the first objective function f_{1} to be minimized stays the same. The second objective function f_{2} is introduced by two constraints which will maintain its value between λ_{ min } and λ_{ max }.
For each solution s in F, a constraint DIFF (s), also present in [31], is added. This constraint forbids to find the solution in F again. The constraint is defined in the following way. Let assume we have found a solution \(x^{*}=\left (x^{*}_{1},x^{*}_{2},\dots,x^{*}_{n}\right) \in F\) of a binary IP. Let define \(B:=\left \{ i  x^{*}_{i}=1\right \}\) and \(N :=\left \{ i  x^{*}_{i}=0\right \}\). The DIFF (x^{∗}) constraint is: \(\sum _{i\in B}(1x_{i}) + \sum _{i\in N}{x_{i}} \geq 1\). This constraint ensures that the (Hamming) distance between any feasible solution s and the solution x^{∗} is at least one. Therefore, there must be at least one variable x_{ i } which takes a different value from \(x^{*}_{i}\).
Of course, these modifications do not change the main algorithm, their aim is to forbid the solutions in F. In the following, we denote again by P_{1}(λ_{ min },λ_{ max },F) the resulting mixed linear program.
We denote by P_{2} the following IP:
max f_{2}(x)
The general idea of our algorithm is to recursively perform a dichotomic search in the areas above and below each new solution found. We denote by nb the number of Pareto sets seeked. At the end of the algorithm, the set \(\mathcal R\) will contain all the solutions belonging to the kth Pareto sets, for 1≤k≤nb. For each solution s found during the execution of the algorithm, we have a label, denoted by l(s), indicating the index of the set this solution belongs to, i.e., l(s)=k iff the solution s belongs to the kth Pareto set.
Our algorithm, called FindKParetoSets works as follows. First, we find a (leftmost) solution L, minimizing the f_{1} criterion. We set its label to 1, l(L):=1, and this solution is added to the set \(\mathcal R\). Notice that since there can exist several solutions minimizing f_{1} with different f_{2} values, this solution does not necessarily belong to the first Pareto set. In that case, its correct label will be set during the remaining execution of the algorithm. Then, we compute the solution U maximizing the f_{2} criterion. An f_{1} value of a solution s is noted as s_{1}, and in the same manner, s_{2} defines the f_{2} value. In the following, U_{2} will serve as an upper bound for the recursive search. Finally the localPareto() procedure is called and performs the recursive search, first below L, between −∞ and L_{2}−ε according to the f_{2} criterion, and then above L, between L_{2} and U_{2}. Here ε is a very small constant such that for any pair of solutions s,s^{′} one has either f_{2}(s)=f_{2}(s^{′}) or f_{2}(s)−f_{2}(s^{′})>ε.
The localPareto() procedure is described below. Each search, corresponding to the computation of a portion of a Pareto set, is done between two values, denoted by λ_{ min } and λ_{ max }, that are taken as two arguments. The set F represents a set of solutions previously found between λ_{ min } and λ_{ max }, that we could find again by solving P_{1}. To avoid it, the solutions of F are forbidden as explained before. If the IP P_{1}(λ_{ min },λ_{ max },F) has a solution s (lines ????), by default its label is set to 1 (line ??). Then, the label of s must be computed according to lines ????. If the label is inferior or equal to nb+1, the solution s is added to \({\mathcal R}\). If necessary, the labels of some previously found solutions of \(\mathcal R\) are updated (lines ?? to ??). Finally, the localPareto() procedure is called to search below s (between λ_{ min } and s_{2}−ε) and above s (between s_{2} and λ_{ max }) if the label is inferior to nb.
The first step of our algorithm is to find the solution denoted L, by solving the BOIP (line ??), and add it to the set \({\mathcal R}\) (line ??). Then a maximum threshold U_{2} is found by solving P_{2} (line ??) to search above the first solution L. A search below the solution L is done (line ??) and the solution s^{1} is found. In the localPareto() procedure, the solution s^{1} obtains the label of the previous solution L. A search below s^{1} is done, but no solution is found. The search above s^{1} is done and s^{2} is found. The recursive search continues until no additional solution is found.
Biobjective integer programming for predicting RNA secondary structures with pseudoknots
In this paper, we propose a method for predicting RNA secondary structures with pseudoknots using the algorithm presented based on a BOIP. Our method allows to return several optimal and several suboptimal solutions, optimizing two objectives related to an MEA model and an MFE model. The MEA model, to which we will refer as Mod1, is based on the model proposed in [10] and uses the Dirks and Pierce set of thermodynamic parameters [8]. The MFE model, to which we will refer as Mod2, is based on the model proposed in [6]. Mod1 and Mod2 can describe all kinds of pseudoknots. In the following, we present first how an RNA structure with pseudoknots can be modeled. Then we describe how we combine Mod1 and Mod2 into one BOIP.
Modeling RNA secondary structures with pseudoknots
In Mod1 and Mod2, the RNA secondary structures are modeled in the following way. An RNA sequence s is composed of n nucleotides or bases which can be A, U, G or C. Each base can be paired according to the WatsonCrick (AU and GC) or the Wobble (GU) pairings. To take into account the pseudoknots, it is assumed that a secondary structure can be decomposed into m pseudoknotfree substructures y^{1}, y^{2},…, y^{ m }, called levels. The levels are disjoint sets meaning that a base pair belongs to exactly one level. From experimental data, it is generally assumed that two levels are sufficient to describe most known RNA structures. Then, in the following, m=2.
A base pair between the bases i and j in level p is represented by a binary variable \(y_{ij}^{p}\) equal to 1, with i=1,…,n and j=i+1,…,n. If there is no base pair between i and j, \(y_{ij}^{p}\) is equal to zero.
The possible types of base pairs correspond to integer values 1,…,6: AU has the value 1, CG the value 2, GC the value 3, GU the value 4, UG the value 5 and UA the value 6.
The possible stacks of two base pairs (i,j) and (i−1,j+1) in level p are defined with binary variables \(x_{ij}^{klp}\), with k and l representing the possible types of base pairs. If \(x_{ij}^{klp}\) is equal to 1, then the bases i and j, and the bases i+1 and j−1 are paired, and in the case where \(x_{ij}^{klp}\) is equal to zero, there is either one base pair or no base pair at all.
Predicting RNA secondary structures with pseudoknots by combining two models
In the BOIP, we combine Mod1 and Mod2. The objective of Mod1 is to find the MEA structure with none pseudoknot or with one or several pseudoknots of any type.
where β^{ p } are constants for each level p, fixed to β^{ p }=1/m, p_{ ij } are the base pair probabilities computed with the Dirks and Pierce model and θ^{ p } is a threshold aiming to ignore the lower base pair probabilities.
with e_{ kl } the energy given in [6], depending on the types k and l of the two base pairs.
For the need of the algorithm, the sign of the function f_{1}(y) is changed to have two objective functions to minimize.
with \(B^{p}=\left \{ ij  y_{ij}^{*p}=1\right \}\) and \(N^{p} =\left \{ ij  y_{ij}^{*p}=0\right \}\).
This constraint corresponds to the previous constraint but the levels of the sets B and N are inverted. Then, the base pairs of the level 1 are forbidden in level 2 and vice versa.
Results
The BOIP presented for predicting RNA secondary structures with pseudoknots is implemented using the CPLEX Optimizer V12.6.3 solver [33]. Our algorithm is implemented with ε=0.001 and m=2. The obtained tool, called BiokoP, is available on our EvryRNA platform.
In the following, we first present the datasets we use for the evaluation of BiokoP, then the experiments showing the distribution of real structures found over the generated solutions. The next section is devoted to a statistical analysis of structures predicted by BiokoP and by other tools from the literature. We end by giving some information on the execution time of BiokoP.
Datasets
We evaluate our approach on a dataset of pseudoknotted RNAs we built from the PseudoBase ++ database [20]. This dataset gathers 198 sequences whose lengths range from 21 to 128 nucleotides.
We also built a second dataset of pseudoknotfree RNAs from the RNA STRAND database [21]. It gathers 145 nonredundant sequences whose lengths range from 10 to 97 nucleotides.
These datasets are available on the EvryRNA platform.
Distribution of real structures over the returned solutions
In this section we study the ability of BiokoP to find the real structures. The purpose is to analyze where the real structures are found, over the Pareto sets or in function of the number of solutions returned. This section is also devoted to a comparison between BiokoP, Mod1 and Mod2 in order to determine the contribution of BiokoP.
Distribution of real structures over the Pareto sets
We study the distribution of real structures returned by BiokoP on our dataset of pseudoknotted RNAs over the Pareto sets. The real structure is the structure that corresponds exactly to the referenced structure for a given RNA.
Distribution of real structures found by BiokoP in function of Pareto sets
kbest Pareto set, k=  1  2  3  4  5  6  7  8  9  10  Total 

Number of real structures  45  15  13  7  1  0  0  0  1  1  83 
Distribution of real structures in function of the number of solutions returned
BiokoP is made to return sets of solutions and all the solutions belonging to one Pareto set are not comparable. Then, this experiment requires to rank the solutions of the Pareto sets returned by BiokoP in order to compare the solutions one against the others. The solutions of each Pareto set are ranked in the following manner: the solutions optimizing equally the two objectives, i.e., the solutions closer to the diagonal, are better ranked.
The results on the dataset of pseudoknotted RNA show that, as expected, BiokoP predicts more real structures than Mod1 and Mod2 (corresponding respectively to Mod1^{so} and Mod2^{so} for one solution returned). Indeed, BiokoP, Mod1 and Mod2 return the real structure for respectively 32, 25 and 23 RNAs. We observe that, in these sets of real structures returned by Mod1 and Mod2, 12 RNAs are identical. Those RNAs also show up in the set of real structures returned by BiokoP. In the remaining real structures found by BiokoP, 6 are neither found by Mod1 nor by Mod2. This shows clearly the pertinence of combining Mod1 and Mod2. Besides, we note that BiokoP finds all the real structures found by Mod1. Some real structures found by Mod2 are not found by BiokoP when one solution is returned but they are all found by BiokoP in the first Pareto set. The real structures found by Mod1 and Mod2 are all returned by BiokoP as optimal solutions, showing that our algorithm succeeds to take benefit from both models.
The more there are solutions, the more BiokoP is likely to find the real structure, and with a fast increase in probability. We observe that after about 20 solutions returned (for about 2 or 3 Pareto sets), the number of real structures found seems to be stable, which supports the results of the previous section. In case of Mod1^{so} and Mod2^{so}, the number of real structures found quickly reaches a plateau amounting to 7 8 solutions returned. This is due to the lack of diversity of the suboptimal solutions. Indeed, the suboptimal solutions are essentially similar to the optimal one: they are derived from the optimal solution by removing only very few base pairs. When the optimal solution is close to the real structure, the real structure can be found quickly as a suboptimal solution, explaining the increase of the curve for a small number of returned solutions.
Finally, this experiment shows that the optimal and suboptimal solutions returned by BiokoP are more likely to contain the real structure compared to those of Mod1^{so} and Mod2^{so}.
Comparison of BiokoP with the literature
Considered software
To evaluate the performances of BiokoP, we compare it with other methods predicting pseudoknotted RNA secondary structures that are able to return several solutions. To our knowledge, only two methods are available in the literature, namely pKiss [4] and McGenus [5]. The principle of pKiss is to decompose the RNA sequence into every possible subwords and to compute the MFE secondary structure of the decompositions. To reduce the search space, pKiss is based on the canonical rules which reduce the number of possible predicted pseudoknots (only certain canonical and kissing pseudoknots) and the redundancy thanks to a nonambiguous dynamic programming algorithm. McGenus is based on a Monte Carlo algorithm which search for a minimum score which includes the energy and the genus of the secondary structure. The genus expresses the complexity of a pseudoknot. McGenus performs a stochastic search that allows to find various types of pseudoknots.
We also compare BiokoP with IPknot [10] and RNAsubopt from the ViennaRNA package [22]. RNAsubopt predicts pseudoknotfree RNA secondary structures using an MFE algorithm to compute all the suboptimal structures in an energy range.
For the evaluation, we consider the first solution returned by IPknot and the 30 first solutions returned by BiokoP, pKiss, McGenus and RNAsubopt. IPknot (version 0.0.4) was executed with the Dirks and Pierce set of thermodynamic parameters and with the options g 2 and g 4. pKiss (version 2.2.12) was executed with the default parameters. We used the option relativeDeviation to obtain up to 30 solutions for each RNA. McGenus (version 7.0) was also executed with the default parameters, with the option nsuboptimal to obtain 30 solutions. We executed RNAsubopt (version 2.3.3) with the option e to obtain 30 solutions and with the option s to sort the solutions by energy.
For pKiss, McGenus and RNAsubopt, the solutions are ranked in the returned order, i.e., in the ascending order of energies. For BiokoP, as the solutions belonging to the same Pareto set are returned in an arbitrary order and are not comparable, we adopt the same ranking as in the previous section. We consider that the best solutions are the ones that optimize equally the two objectives, and are therefore closer to the diagonal.
Statistics used
Overall results
Sensitivity and PPV results for BiokoP, pKiss and McGenus on pseudoknotted RNAs
NbSol  BiokoP  pKiss  McGenus  BiokoP  pKiss  McGenus  

Sensitivity s.d.  PPV s.d.  
1  80.6  22.3  79.5  24.2  73.4  26.6  75.0  25.5  75.1  26.6  74.1  28.6 
2  80.6  22.3  77.6  23.5  67.1  26.7  73.2  25.5  74.8  26.2  69.7  31.6 
3  80.1  22.4  75.8  23.6  63.8  30.5  71.6  25.7  74.2  26.4  67.3  32.7 
4  79.5  22.8  74.4  23.8  62.2  30.8  70.5  26.0  73.5  26.6  66.2  33.3 
5  79.0  23.1  73.1  24.0  61.0  31.0  69.7  26.2  72.9  26.8  65.5  33.6 
10  77.0  23.6  68.8  24.1  57.4  31.9  67.4  26.4  70.9  27.1  62.6  35.3 
15  75.8  23.6  65.7  24.5  55.7  32.6  66.4  26.4  69.3  27.6  61.0  36.1 
20  75.1  23.6  63.2  25.0  54.2  33.1  65.9  26.4  67.9  28.1  59.4  36.8 
25  74.5  23.5  61.2  25.3  53.3  33.4  65.5  26.3  66.7  28.5  58.4  37.2 
30  73.8  23.5  59.5  25.5  53.3  33.4  65.0  26.3  65.2  28.9  58.5  37.2 
For one solution returned, BiokoP gives similar results to IPknot (IPknot gives a mean sensitivity of 80.6%, a mean PPV of 75.1% and a mean F_{1}score of 77,0%).
Results over optimal solutions
Finally, we observe that the gap between the minimum and the maximum F_{1}scores of BiokoP can be important. This shows that BiokoP returns a diversified set of optimal solutions.
Results by pseudoknot types
The BOIP of BiokoP has been modeled to be able to predict any kind of pseudoknots. This is confirmed by the results obtained that are very homogeneous. Indeed, the F_{1}scores of BiokoP are never lower than 70% for any number of solutions returned. This is not the case for pKiss and McGenus, for which we can observe that the results depend greatly on the pseudoknot type. In particular, they obtain F_{1}scores around 50% for the HLin type. Since the datasets of some pseudoknot types are small (3 HHH, 26 HLout, 4 HLin, 11 LL and 154 H), further experiments need to be done to confirm the results.
Finally, when one wants to predict a secondary structure of an RNA, there is generally no information about the pseudoknot type. Therefore, it is better to use BiokoP, which allows to predict structures as close as possible to the real ones, regardless of the pseudoknot type.
Results on pseudoknotfree RNAs
In order to determine if BiokoP is able to predict pseudoknotfree structures of RNAs, we evaluated it on our dataset of pseudoknotfree RNAs and compared the obtained results with those obtained by pKiss, McGenus, IPknot and also RNAsubopt which is a tool dedicated to pseudoknotfree RNA prediction.
Illustration on some examples of RNAs
We also study the results of BiokoP on a pseudoknotfree hammerhead ribozyme [37] (Fig. 10b). This particular structure, composed of three helices, is much studied for the understanding of the structurefunction relation of RNAs. The best solution returned by BiokoP has a F_{1}score of 81.3% and is returned at rank 9, in the second Pareto set. BiokoP finds the two first helices and some base pairs of the third one, but predicts an extra pseudoknot of two base pairs. pKiss and McGenus find the two first helices of the RNA. However, pKiss predicts an hairpin that does not exist in the referenced structure and McGenus does not find any base pair of the third helix. RNAsubopt finds the two first helices and some base pairs of the third one. The structure predicted by IPknot is identical to the one predicted by BiokoP, but its pseudoknot possesses an additional base pair. This example supports the previous results on the pseudoknotfree RNA dataset. As RNAsubopt predicts structures without pseudoknots, its results are better than the ones obtained by the other tools which can predict extra pseudoknots. However, the best solution of RNAsubopt is predicted at rank 25 which is far compared to the one of BiokoP.
Execution time
The complexity of solving optimally integer programs is exponential with respect to the number of variables, which depends mainly on the length of the sequence. Moreover, we deal here with multiobjective programming which increases the number of BOIPs to solve and their number of variables. As estimation of the time of BiokoP, the prediction of 30 secondary structures for an RNA of 30 nucleotides on Debian OS with 24 Intel Xeon CPU X5660 (2.8 GHz) and 64Go RAM took, in average, 22 s. For the pseudoknot region PK4 [35] of the Legionella Pneumophila tmRNA (PKB70) of 55 nucleotides, BiokoP took 5min30.
Discussion
We developed an original generic algorithm to solve BOIPs and to return several optimal solutions (the exact Pareto set) and several suboptimal solutions (the suboptimal Pareto sets). We proposed a BOIP that combines two models of RNA secondary structure prediction with pseudoknots, namely the MEA and the MFE models. The implementation of our algorithm with the proposed BOIP has led to a tool called BiokoP (Biobjective programming pseudoknot Prediction).
Since our method is based on multiobjective optimization, BiokoP returns several sets of solutions (the kbest Pareto sets). Those Pareto sets are returned in an optimal order. However, all the solutions belonging to a Pareto set are not comparable and hence cannot be meaningfully ordered within the set, whereas the tools of the literature return solutions which are ordered optimally. Hence, comparing the results of BiokoP and of the tools of the literature raises some difficulties and, in multiobjective optimization, defining pertinent performance metrics for the Pareto sets is a subject of research [38]. We have chosen to rank the solutions within each Pareto sets assuming that a solution is better if it optimizes equally the two objectives, i.e., if the solution is close to the diagonal. This ranking allows us to show the benefits of combining Mod1 and Mod2. Indeed, the experiments show that for the first ranked solution returned, BiokoP is more likely to return the real structure than Mod1 and Mod2 (Fig. 5). The experiments also show that this combination allows to obtain homogeneous F_{1}scores compared to the other tools regardless of the number of solutions returned, the type of pseudoknots and the presence or not of pseudoknots (Figs. 6, 8, and 9).
However, this ranking shows some limitations. It is illustrated by the results detailed by pseudoknot types, where we observe that the solution giving the best F_{1}score is not always the first one (Fig. 8). To have a better idea of the global quality of the overall first Pareto set, we decided to compare it versus the optimal solution of the other tools, by studying the best and the worst structures (according to the F_{1}score) of the first Pareto set. The results show that in most cases, BiokoP finds more often better solutions than the other tools (Fig. 7).

that the combination of Mod1 and Mod2 is relevant: BiokoP returns all the real structures found by Mod1 and Mod2 as optimal solutions and is able to find more real structures (in optimal or in suboptimal Pareto sets) than the two models alone;

and that BiokoP, compared to the literature, achieves the best compromise between the number of times the best solution (with the highest F_{1}score) is found and the global quality of predictions.
Conclusion
In this paper, we provide an original approach for predicting RNA secondary structures with pseudoknots that is based on the computation of several optimal and several suboptimal solutions with respect to two different models. Our method is based on integer programming approach, that presents the advantage to be more flexible compared to the dynamic programming approach usually used in the RNA secondary structure prediction. In order to combine the two models, we developed a biobjective integer program (BOIP), and proposed a generic novel algorithm to compute the sets of optimal and suboptimal solutions for any BOIP. We applied it to the prediction of RNA secondary structures with pseudoknots, obtaining the software BiokoP. The experimental tests performed have confirmed the importance of considering suboptimal solutions in addition to optimal ones. We have shown that our method successfully combines Mod1 and Mod2, taking benefit from the two models.
BiokoP was compared with other tools from the literature for RNA secondary structure prediction with pseudoknots that propose several solutions, namely pKiss [4] and McGenus [5]. BiokoP was also compared to IPknot [10]. IPknot predicts pseudoknotted structure, but no suboptimal solutions. Finally, BiokoP was compared to RNAsubopt [22] which can predict only pseudoknotfree structures, but with suboptimal solutions. Considering a set of 198 pseudoknotted secondary structures gathered from Pseudobase++ [20], BiokoP gives better F_{1}scores than pKiss and McGenus. We also show that BiokoP returns more time the best solution (with the higher F_{1}score) than pKiss, McGenus and IPknot, considering the optimal solutions. Even if BiokoP is made to predict pseudoknotted RNAs, the F_{1}score results for the dataset of pseudoknotfree RNAs are close to the results of RNAsubopt. The results of BiokoP are widely homogeneous regardless of the pseudoknot type or the presence or not of pseudoknots. Indeed, the F_{1}scores are always higher than 70% and regardless of the number of solutions returned, unlike pKiss and McGenus.
A drawback of our approach is the time complexity due to the need to optimally solve several integer programs. As stated before, this complexity is exponential, with respect to the length of the sequence. A perspective to decrease the execution time is to use parallelism. Indeed, due to its recursive shape, the generic algorithm we propose to solve BOIP is suitable to parallelization. Another idea to decrease the complexity of the BOIP would be to add constraints when some information on the structure are known: base pairs, presence or not of pseudoknots, type of pseudoknots.
In this work, we combined two models inspired from the literature. A perspective is to combine better monocriterion models in order to raise the global quality of the solutions found. We would first try to model the comparative approach with integer programming to propose a combination with one of the thermodynamic model. This would take benefit from the additional information brought by homologous sequences. Finally, another future work will consist in developing multiobjective algorithms in order to combine more than two models. Moving from two objectives to three will require to rethink the current algorithm in three dimensions. This would represent the first step to develop a multiobjective generic algorithm able to return suboptimal Pareto sets for any IP with n objective functions.
Declarations
Acknowledgements
Fariza Tahi would like to thank the Genopole for the financial support given for the platform EvryRNA.
Funding
The platform EvryRNA is partially supported by the Genopole.
Availability of data and materials
The developed software, BiokoP, and the datasets are available on our EvryRNA platform (https://EvryRNA.ibisc.univevry.fr).
Authors’ contributions
AL developed the algorithm and tested it. AL recovered the data and analyzed the results. EA contributed to the integer programming part of this work. FT was involved in the RNA secondary structure part and supervised the work. The three authors participated in the writing of the manuscript and approved the final version of the manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Namy O, Moran SJ, Stuart DI, Gilbert RJ, Brierley I. A mechanical explanation of RNA pseudoknot function in programmed ribosomal frameshifting. Nature. 2006; 441(7090):244–7.View ArticlePubMedGoogle Scholar
 Sheikh S, Backofen R, Ponty Y. Impact of the energy model on the complexity of RNA folding with pseudoknots. In: Annual Symposium on Combinatorial Pattern Matching. Berlin: Springer: 2012. p. 321–33.Google Scholar
 Lyngsø RB, Pedersen CN. RNA pseudoknot prediction in energybased models. J Comput Biol. 2000; 7(34):409–27.View ArticlePubMedGoogle Scholar
 Janssen S, Giegerich R. The RNA shapes studio. Bioinformatics. 2014; 31(3):423–425. Oxford Univ Press.View ArticlePubMedPubMed CentralGoogle Scholar
 Bon M, Micheletti C, Orland H. McGenus: a Monte Carlo algorithm to predict RNA secondary structures with pseudoknots. Nucleic Acids Res. 2012; 41(3):1895–1900. Oxford Univ Press.View ArticlePubMedPubMed CentralGoogle Scholar
 Poolsap U, Kato Y, Akutsu T. Prediction of RNA secondary structure with pseudoknots using integer programming. BMC Bioinformatics. 2009; 10(Suppl 1):38.View ArticleGoogle Scholar
 Reeder J, Giegerich R. Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinformatics. 2004; 5(1):104.View ArticlePubMedPubMed CentralGoogle Scholar
 Dirks RM, Pierce NA. A partition function algorithm for nucleic acid secondary structure including pseudoknots. J Comput Chem. 2003; 24(13):1664–77.View ArticlePubMedGoogle Scholar
 Rivas E, Eddy SR. A dynamic programming algorithm for RNA structure prediction including pseudoknots. J Mol Biol. 1999; 285(5):2053–68.View ArticlePubMedGoogle Scholar
 Sato K, Kato Y, Hamada M, Akutsu T, Asai K. IPknot: fast and accurate prediction of RNA secondary structures with pseudoknots using integer programming. Bioinformatics. 2011; 27(13):85–93.View ArticleGoogle Scholar
 Bellaousov S, Mathews DH. ProbKnot: fast prediction of RNA secondary structure including pseudoknots. RNA. 2010; 16(10):1870–80.View ArticlePubMedPubMed CentralGoogle Scholar
 Engelen S, Tahi F. Tfold: efficient in silico prediction of noncoding RNA secondary structures. Nucleic Acids Res. 2010; 38(7):2453–66.View ArticlePubMedPubMed CentralGoogle Scholar
 Ruan J, Stormo GD, Zhang W. An iterated loop matching approach to the prediction of RNA secondary structures with pseudoknots. Bioinformatics. 2004; 20(1):58–66.View ArticlePubMedGoogle Scholar
 Saule C, Giegerich R. Pareto optimization in algebraic dynamic programming. Algoritm Mol Biol. 2015; 10(1):10–22.View ArticleGoogle Scholar
 Zuker M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 2003; 31(13):3406–15.View ArticlePubMedPubMed CentralGoogle Scholar
 McCaskill JS. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990; 29(67):1105–19.View ArticlePubMedGoogle Scholar
 Kato Y, Sato K, Hamada M, Watanabe Y, Asai K, Akutsu T. RactIP: fast and accurate prediction of RNARNA interaction using integer programming. Bioinformatics. 2010; 26(18):460–6.View ArticleGoogle Scholar
 Bauer M, Klau GW, Reinert K. An exact mathematical programming approach to multiple RNA sequencestructure alignment. Algorithmic Oper Res. 2008; 3(2):130–146.Google Scholar
 Reinharz V, Major F, Waldispühl J. Towards 3D structure prediction of large RNA molecules: an integer programming framework to insert local 3D motifs in RNA secondary structure. Bioinformatics. 2012; 28(12):207–14.View ArticleGoogle Scholar
 Taufer M, Licon A, Araiza R, Mireles D, Van Batenburg F, Gultyaev AP, Leung MY. PseudoBase++: an extension of PseudoBase for easy searching, formatting and visualization of pseudoknots. Nucleic Acids Res. 2009; 37(suppl 1):127–35.View ArticleGoogle Scholar
 Andronescu M, Bereg V, Hoos HH, Condon A. RNA STRAND: the RNA secondary structure and statistical analysis database. BMC Bioinforma. 2008; 9(1):340.View ArticleGoogle Scholar
 Lorenz R, Bernhart SH, Zu Siederdissen CH, Tafer H, Flamm C, Stadler PF, Hofacker IL. Viennarna package 2.0. Algoritm Mol Biol. 2011; 6(1):26.View ArticleGoogle Scholar
 Williams HP. Model building in mathematical programming. West Sussex: Wiley; 1999.Google Scholar
 Özlen M, Burton BA, MacRae CA. Multiobjective integer programming: An improved recursive algorithm. J Optim Theory Appl. 2014; 160(2):470–82.View ArticleGoogle Scholar
 Özlen M, Azizoğlu M. Multiobjective integer programming: a general approach for generating all nondominated solutions. Eur J Oper Res. 2009; 199(1):25–35.View ArticleGoogle Scholar
 Bérubé JF, Gendreau M, Potvin JY. An exact εconstraint method for biobjective combinatorial optimization problems: Application to the Traveling Salesman Problem with Profits. Eur J Oper Res. 2009; 194(1):39–50.View ArticleGoogle Scholar
 Ralphs TK, Saltzman MJ, Wiecek MM. An improved algorithm for solving biobjective integer programs. Ann Oper Res. 2006; 147:43–70.View ArticleGoogle Scholar
 Ulungu EL, Teghem J. The twophases method: An efficient procedure to solve biobjective combinatorial optimization problems. Fondations Comput Decis Sci. 1995; 20:149–65.Google Scholar
 Bazgan C, Jamain F, Vanderpooten D. Approximate pareto sets of minimal size for multiobjective optimization problems. Oper Res Lett. 2015; 43(1):1–6.View ArticleGoogle Scholar
 Vassilvitskii S, Yannakakis M. Efficiently computing succint tradeoff curves. Theor Comput Sci. 2005; 348(2–3):334–56.View ArticleGoogle Scholar
 Balas E, Jeroslow R. Canonical cuts on the unit hypercube. SIAM J Appl Math. 1972; 23(1):61–9.View ArticleGoogle Scholar
 Tsai JF, Lin MH, Hu YC. Finding multiple solutions to general integer linear programs. Eur J Oper Res. 2008; 184(2):802–9.View ArticleGoogle Scholar
 IBM. CPLEX Optimizer V12.6.3. http://www03.ibm.com/software/products/fr/ibmilogcpleoptistud. Accessed 09 Jan 2018.
 Huang X, Ali H. High sensitivity rna pseudoknot prediction. Nucleic Acids Res. 2006; 35(2):656–63.View ArticlePubMedPubMed CentralGoogle Scholar
 Williams KP, Bartel DP. Phylogenetic analysis of tmRNA secondary structure. Rna. 1996; 2(12):1306–10.PubMedPubMed CentralGoogle Scholar
 Kerpedjiev P, Hammer S, Hofacker IL. Forna (forcedirected RNA): Simple and effective online RNA secondary structure diagrams. Bioinformatics. 2015; 31(20):3377–3379. Oxford Univ Press.View ArticlePubMedPubMed CentralGoogle Scholar
 GriffithsJones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A. Rfam: annotating noncoding RNAs in complete genomes. Nucleic Acids Res. 2005; 33(suppl_1):121–4.Google Scholar
 Cheng S, Shi Y, Qin Q. On the performance metrics of multiobjective optimization. Adv Swarm Intell. 2012;504.Google Scholar