As stated before, the RNA complex prediction problem can be viewed as a graph problem in which we must find a constrained clique. In this section, we first describe the relationship between the RNA complex prediction and the Maximum Weight Clique Problem (MWCP). Then, we propose an heuristic to find good solutions in a polynomial time for the constrained MWCP.
Predicting RNA complexes: a constrained MWCP
RNA complexes
An RNA complex is composed of a set of structured RNAs that interact with each other. It can therefore be considered as a set of RNA secondary structures and of RNA-RNA interactions. A secondary structure involves exactly one RNA strand and is composed of a list of base pairs internal to this RNA strand (Fig. 1a). A secondary structure can contain pseudoknots (Fig. 1b). An interaction site (Fig. 1c) involves two RNA strands, is composed of a list of base pairs, and does not contain crossing interactions (Fig. 1d).
Constrained MWCP
The RNA complex prediction problem can be formalized using a weighted graph G(V,E) such as:
-
V, the vertex set, is composed of two subsets, VS and VI, where VS is the set of vertices representing the secondary structures and VI is the set of vertices representing the interactions. Each vertex v∈V has a weight equals to the free energy associated to the structure or the interaction.
-
E, the edge set, represents the compatibilities between the vertices. An edge exists if and only if two vertices are compatible. We consider that two vertices are not compatible if at least two identical nucleotides are involved in different pairings or if the two vertices are secondary structures involving the same RNA strand. These compatibility rules allow the presence of any motif in RNA complexes: pseudoknots (that can already be present in the secondary structures, which are represented by the vertices VS) or crossing interactions.
An RNA complex can be viewed as a complete graph, or a clique, where each vertex is linked to all the other vertices. This clique is constrained because, for each RNA, there must be exactly one secondary structure vertex. This brings another constraint. In some known complexes, the RNAs do not have internal base pairs, which implies to add for each RNA a vertex corresponding to a secondary structure with no base pairs. However, an RNA complex only composed of secondary structures or interactions with no base pairs at all is not an RNA complex. We call the cliques corresponding to this type of RNA complexes weak cliques.
The weight of a clique (constrained or not) is the sum of the weights of the vertices composing the clique. We have therefore a constrained maximum vertex weight clique problem, denoted in the sequel by constrained MWCP, where the clique i) is composed of exactly one secondary structure per RNA, ii) is not weak and iii) has a minimum free energy.
Free energy computation
To each secondary structure and each interaction is associated a free energy represented by the vertex weights in the graph. This energy is computed to unify the different sources of secondary structures and interactions. We use two energy models:
-
The first model is the Turner model [33] (with the 2004 parameter release), which can be used for secondary structure without pseudoknots and for interactions.
-
The second model is based on the sum of the stacking energies taken from the Turner model (used in [24]). This allows to compute the free energy of pseudoknotted secondary structures.
Once the free energy of each secondary structure and interaction is known, it is used as the weight of each vertex. Then, the free energy of a complex (a constrained clique) is approximated by the sum of the free energies (weights) of the secondary structures and interactions (the vertices in the clique) composing it.
Solving the constrained MWCP
The MWCP is NP-hard. However, this problem is well studied and various methods exist to solve it. Exact methods, which find the optimal solution by optimizing the weight of the clique, are either generalization of methods for the unweight problem [34, 35] or are branch and bound algorithms [36]. Since exact methods are time consuming due to the NP-hardness nature of the problem, a lot of various heuristics have been proposed, either based on local search [37], tabu search [38], both of them [32] or other techniques [39, 40].
In this paper, we propose an adaptation of the heuristic Breakout Local Search (BLS) published in [32] that provides good solutions in a short amount of time. In the following, we first present BLS and then the heuristic we propose for our constrained MWCP that we denote by BLS-CMWCP.
Breakout Local Search heuristic
The BLS heuristic [32] was proposed for the MWCP and is based on both local search and tabu search.
Local search [41] is an heurisitic method to find good solutions for combinatorial optimization problems. The local search is an iterative method. It starts from an initial solution and modifies it at each step by looking in its neighborhood, i.e. a set of neighboring solutions obtained by performing small modifications (movements) on the current solution, for a better solution. When a solution cannot be improved anymore, it is called a local optimum solution.
Tabu search [42] is a metaheuristic based on local search. The main difference with the local search is that at each iteration the best solution in the neighborhood of the current solution is selected, even if it is not better than the current solution. In order to avoid cycling through previously encountered solutions, a tabu list is used.
BLS starts from a random solution and then performs alternatively two phases until the known optimal solution is found or the time limit is reached:
-
1
Local search: to perform a local search until a local optimum is found.
-
2
Perturbation: to modify greatly the local optimum solution to escape from it and explore further the search space.
In the local search phase of BLS, all the possible movements are considered and the one optimizing the most the solution weight is chosen. The available movements are either to add a vertex in the clique or to replace a vertex in the clique with one that is not in the clique. To define the movements, some definitions are needed. Let G(E,V) be a graph and C be the current clique.
-
PA is the set of vertices that can be directly added in the clique C, because there exist edges between them and all the vertices of the clique C; PA={v:v∉C,∀u∈C ∃[v,u]∈E}.
-
OM is the set of pairs of vertices (v,u) where v is not in the clique C but there exist edges between it and all the vertices of the clique C, except u which is in the clique C; OM={(v,u):v∉C and u∈C,∀v′∈C∖{u}∃[v,v′]∈E and [v,u]∉E}. The OM set is used to do the replacement movements.
-
OC is the set composed of all the vertices outside C; OC={v:V∖C}.
The perturbation phase aims to modify the current solution to escape a local optimum. The perturbation can greatly degrade the solution, the strength depending on how many times the solution was not improved in the local search phase. The perturbation strategies are based on four movements that are performed several times: to add, replace or remove a vertex of the clique (weak perturbation) and to restart (strong perturbation). The perturbation phase uses a tabu list in order to avoid to pick up again a vertex for a movement if it was removed from the solution some iterations before. The perturbation phase is a main difference with other local search methods and allows to explore more efficiently and faster the search space.
The authors show that this heuristic provides improved results for a number of MWCP instances and that this heuristic is usable for large graphs in reasonable time. It makes this heuristic a good candidate to develop a tool for the RNA complex prediction since large sets of inputs can be used.
The BLS-CMWCP algorithm
The BLS method is adapted for the constrained MWCP by making some modifications to the initial clique finding phase and to the movements, in order (i) to take into account the different kinds of vertices, (ii) to take into account the constraints related to the secondary structures and (iii) to avoid the weak cliques.
Before describing our BLS-CMWCP algorithm, let us give the new definitions about the sets used (illustrated in Fig. 2). Let C be a clique from the initial graph G(E,V).
-
Let PA be the set composed of all the interaction vertices that are outside C and are connected to all the vertices in C; PA={vI:vI∉C,∀u∈C ∃[vI,u]∈E}.
-
Let OM be the set composed of the interaction vertices pairs (vI,uI) (or secondary structure vertices pairs (vS,uS)) such that vI (or vS) is outside C and is connected to all vertices in C except to the vertex uI∈C (or uS∈C).
-
Let OC be the set composed of all interaction vertices that are outside C; OC={vI:VI∖C}.
Having only the interaction vertices vI in PA and in OC and having only interaction vertex pairs (vI,uI) or secondary structure vertex pairs (vS,uS) in OM allow only the movements respecting the constraint of having exactly one secondary structure per RNA. In the following, we describe each modification and the differences between our algorithm BLS-CMWCP and BLS:
-
To generate the initial clique: in BLS, the phase consists in selecting randomly a vertex and then to add iteratively vertices if they form a clique, until no more vertex can be added. In BLS-CMWCP, this phase consists in selecting randomly an interaction vertex and then selecting for each RNA a secondary structure vertex that forms a clique. Forming a clique is always possible thanks to the empty secondary structures which are obviously compatible with any interaction vertex.
-
To add a vertex movement: an interaction vertex vI is selected in PA and added into C.
-
To replace a vertex movement: a vertex pair (vI,uI)(or(vS,uS)) is selected in OM, vI (or vS) is added to C and uI (or uS) is removed from C.
-
To remove a vertex movement: an interaction vertex vI is selected in C to be removed.
-
To restart the clique movement: an interaction vertex vI from OC is added to the clique C. Then if the structure vertices of the clique, vS∈C, do not form a clique, they are replaced with other structure vertices. Finally, the remaining interaction vertices of the clique, vI∈C, are removed if they do not form a clique anymore.
-
To generate sub-optimal cliques: contrary to BLS method, we want to return several sub-optimal cliques. In BLS-CMWCP, at each iteration of the local search and of the perturbation phases, any new clique is saved. Then when the solutions are returned, they are sorted according to their free energy.
-
To forbid the weak cliques: during all the search, if a movement leads to a weak clique, it is not considered.
RCPred: implementation of BLS-CMWCP for RNA complex prediction
The BLS heuristic proposed in [32] has a set of parameters to modulate the strength of the perturbation (L0 and LMax), the maximum number of non-improving solutions before a strong perturbation is performed (T), the coefficients for accepting non-improving solutions (αs and αr), the coefficient for tabu tenure (ϕ) and the probability for applying directed perturbations (P0). Some parameters were fixed in the BLS heuristic and we used them as such in BLS-CMWCP. We determined the other parameters by performing experiments on a dataset of 30 graphs derived from RNA complexes. We then chose the following parameters: L0=0.1∗|V|, LMax=0.1∗|V|, T=10, αs=0.5, αr=0.5, ϕ=7 and P0=1. The stop condition for the local search occurs either when the optimum clique is found or the maximum number of iterations is reached. In RNA complex prediction we do not know the optimum clique, then the stop condition here is a maximum number of iterations (fixed at 500). This parameter can be set by the user.
We implemented in C++ BLS-CMWCP and obtained the tool called RCPred (RNA Complex Prediction). RCPred takes as inputs n sequences of RNAs, several secondary structures per RNA and several interactions per pair of RNAs. First the energies of the secondary structures and the interactions are computed. The compatibilities between the secondary structures and the interactions are determined and the graph is built. BLS-CMWCP returns constrained cliques from which are derived RNA complexes. If some sequences are identical, symmetrical complexes can occur. These lasts are identified and removed to avoid redundancies in the results. Finally, the RNA complexes are sorted according to their free energy. RCPred is available on the EvryRNA platform.