In the following, we give a brief overview of antaRNA’s optimization approach. A detailed description including all formalisms is provided in [28]. Subsequently, we introduce the recent extension of antaRNA to the design of sequences for crossing pseudoknot structures.
Overview
Given an RNA secondary structure constraint in extended dot-bracket notation \(\mathbb {C}^{\text {str}}\), a targeted GC-content value \(\mathbb {C}^{\text {gc}}\) and supplemental sequence constraint \(\mathbb {C}^{\text {seq}}\) using IUPAC nucleotide definitions, antaRNA [28] solves the RNA inverse fold problem.
To this end, the Ant Colony Optimization technique [33, 34], an automatically adapting local search scheme, is applied. It mimics the ants’ adaptive search for food within a given terrain (see Fig. 1 and Algorithm 1). Here, the terrain is a graph encoding of the inverse folding problem with weighted edges representing the ants’ pheromone that guides their search. During an ant’s walk, the current pheromonic state of the terrain guides an ant to make its decisions in selecting certain edges, which lead to nucleotide-emitting vertices. Within one walk, an ant assembles a solution sequence. Dependent on the quality of the sequence with respect to its structure, sequence and GC-distances to the respective constraints, the pheromonic state of the terrain graph is updated according to a solution’s quality score.
Therefore, after a certain number of consecutive sequence assemblies and terrain adaptations, the features of the assembled sequences converge towards the anticipated constraints of the input [28].
Pseudoknot structures
A main focus of inverse folding is the probability that the designed sequences fold into a given target structure. To this end, for each assembled sequence the minimum free energy (mfe) structure is predicted. antaRNA’s structural distance measure, d
str, evaluates the compliance of an mfe structure with the structural target. This distance guides the pheromone update of the terrain.
For nested target structures, mfe prediciton was done using RNAfold from the ViennaRNA-package [11, 35]. In this work, structure constraints have been extended to support crossing, i.e. pseudoknot, structures. To this end, the structure predictor employed in antaRNA was substituted with the program pKiss [29]. pKiss is capable of predicting two specific subclasses of pseudoknots: hairpin (H-type) and kissing hairpin (K-type) structures. Both types are biologically important, even though H-type pseudoknots have been reported more often in the literature and in data bases. Both play crucial roles in various key functional domains of RNAs [36].
Since mfe structure prediction is done for each assembled sequence, its time complexity is of importance. RNAfold finds nested structures with a time complexity of \(\mathcal {O}(n^{3})\) for sequences of length n [37]. pKiss predicts mfe structures with pseudoknots in \(\mathcal {O}(n^{4})\) when heuristics are applied. For exact mfe calculations, pKiss requires \(\mathcal {O}(n^{6})\) time [29]. antaRNA provides the possibility to choose the prediction method applied by pKiss.
antaRNA was extended such that the structure parsing and management now respects the increased complexity of pseudoknotted structures. The allowed set of brackets within the dot-bracket structure constraint notation was extended to “()[]{}<>” as it is used by pKiss. Furthermore, a pKiss-optimized set of parameters for antaRNA has been identified, when using pKiss for structure prediction. This is discussed in the following sections.
New features
In addition to pseudoknot structure support, antaRNA now provides soft sequence and improved hard fuzzy structure constraint definitions. Both increase the level of detail, at which the target constraints can be defined.
The soft sequence constraint now allows to specify (in lower case letters) the preference for a nucleotide at a certain position. The nucleotide is then not enforced but penalized in the sequence quality assessment if a different nucleotide was set. This enables more flexibility to the antaRNA-based sequence design.
The fuzzy structure constraint, based on the already existent implicit block constraint framework of antaRNA [28], allows to define regions of structural interaction (using lower case letters), in which no explicit structure is predefined. For instance, the structural constraint \(\mathbb {C}^{\text {str}} = \)‘(aaaaaa)’ is neither violated if a base pair is present in the a-block, e.g. ‘((....))’ or ‘(.(...))’, nor if no base pair is designed, i.e. ‘(......)’. So far, if no base pair was formed within such a block no penalty (structural distance) was applied. By introducing the new hard fuzzy structure constraint framework (encoded by upper case letters), now the ‘no base pair’ case is penalized, if found within a solution. The structural distance is increased by the equivalence of one missing explicit base pair for each upper case block that shows no base pair. Therefore, at least one base pair has to be designed within a defined hard fuzzy structure constraint block. The latter adds a more imperative form of fuzziness to the structure constraint definition within antaRNA.