The inverse folding is implemented by a fairly standard genetic algorithm (GA) approach
[15], expanding a population by mutations and recombinations and selecting the most fit individuals for propagation to the next generation. The main deviation from a completely generic GA is that the method is aware of the aim of designing sequences folding into one or more target structures. Rather than search the full sequence space, we direct the search by ensuring all sequences can fold into the target structure(s) forming only canonical base pairs. This can be viewed as a similar approach to the local search/adaptive walk on a hierarchical decomposition of the target structure implemented by some methods
[8, 10, 11, 16], except that the recombination operation chooses a random decomposition and assesses two complementary structural components in conjunction, rather than independently.

In addition to a random search, our method also implements several strategies for more directed evolution. Instead of making uninformed evolutionary changes and leave it to the selection part of the GA to direct the search, it is also possible to bias the choice of change towards e.g. mutating positions with a predicted structure that does not match the target, or in a recombination towards parts that have a good match to the target structure in the regions where they contribute sequence. These strategies are all available through command line options (and as classes when the implementation is used as a library rather than stand-alone program) to allow experimentation and tailoring to specific applications. Through experimentation, a set of parameters was found which worked reasonably on all types of structure, which is set as a pre-defined default. In the following, we will use
to denote a target consisting of a non-empty set of structures, all of length *n*, and *s* to denote a sequence in the GA population.

### Positional fitness

A key concept in the GA is the fitness of positions. This allows directing evolutionary changes by choosing unfit positions for mutations – an approach also taken in
[8, 10] and to a lesser extent in
[3, 9, 11] – and regions with fit positions for recombination. Positional fitnesses take a value of 1 for maximally unfit positions and a value of 0 for maximally fit positions.

We define positional fitness schemes relative to a single target structure *τ*. If
consists of multiple target structures, the final positional fitness is computed as the average positional fitness over all structures in
. Let *τ*
*i* denote the required structure for position *i* in *τ*, i.e. either forming a base pair with another position *j* or unpaired. Let *σ*denote the optimum structure predicted for *s* (i.e. the minimum free energy structure), and let
denote the set of possible structures for position *i*, i.e. base pairs with position *i* as one base in the pair and the unpaired configuration. Finally, let
be the marginal Boltzmann probability in the form of a mapping from positional structure to marginal probability. Our method provides the following choices for the fitness of position *i*, denoted *f*[*i*].

Scheme 1. Binary indicator of whether position has correct predicted structure, *f*[*i*]= 1 − *δ*
_{
σ [i],τ[i]} where *δ* is the Kronecker delta. This corresponds to the *μ*(*s*
*τ*) based objective in
[10], providing a binary view of whether the design matches at position *i* that may be too coarse grained. However, it does also capture *σ*is correct at position *i*, or the design needs to change.

Scheme 2. Boltzmann probability of target structure, *f*[*i*] = 1 − *p*(*τ*
*i*). This corresponds to the *n*(*s*
*τ*) objective in
[10], providing a measure of how likely a structure drawn from the Boltzmann ensemble is to match *τ*at position *i*. This provides more fine grained information than scheme 1, but measures proximity rather than match to *τ*.

Scheme 3. Truncated negative logarithm of Boltzmann probability of target structure, *f*[*i*] = min{−log(*p*(*τ*
*i*))/100,1}. This penalises decreasing probability to match *τ*with increasingly severity, and is particularly useful for multiple structure targets where the average corresponds to a sum of logarithms, thus penalising designs failing to match just one of the target structures. Truncation is performed to allow mapping to [0,1].

Scheme 4. Binary indicator of whether probability of target structure exceeds threshold
. This allows requiring all elements of *τ*to be present with a given probability when sampling the Boltzmann ensemble, but being oblivious to further improvements beyond that. It is particularly useful when
specifies a multi-stable design, i.e. a target with multiple structures for the same temperature, by measuring the number of target structures the design matches with probability at least *θ* at position *i*.

Scheme 5. Sigmoid transformed difference between Boltzmann probability of target structure and most probable alternative structure,
where
. This allows accepting lower probabilities of *τ*
*i*in regions where the structure is generally less well defined, although a position is still most fit iff *p*(*τ*
*i*) = 1. The transformation corresponds to
, normalised to yield values in [0,1], and causes changes to have larger effects when the probability of *τ*
*i*and the most probable alternative are close to being equal.

The *p* and *σ* are computed at the temperature specified for *τ*, and may thus differ between the *τ*’s when a multiple structure target is specified.

Finally, the following positional fitness schemes specifically designed for multiple structure targets
are defined.

Scheme 6. Minimum Boltzmann probability of target structures,
.

Scheme 7. Product of Boltzmann probabilities of target structure,
.

For single structure targets, they are equivalent to the Boltzmann scheme, scheme 2. For multiple structure targets, scheme 6 exclusively focuses on the worst fitness over all target structures, while scheme 7 includes Boltzmann probabilities from all target structures. However, by multiplying the probabilities, having a low fitness on just a single target structure will have a much more notable effect, than under the sum implicit in the averaging of scheme 2.

In addition to a single of the above schemes, there is the possibility of using a weighted combination of any subset of them to define positional fitness. E.g. combining the first two schemes would divide positions based on whether they have a predicted structure matching the target, but further graduate the fitness by the marginal Boltzmann probability of the target structure at each position.

The concept of positional fitness underpins most operations of the GA: mutation, recombination, selection, and termination. Whenever fitness of a region (for recombination cross over point selection) or the entire sequence (for selection and termination) is needed, this is obtained as the sum of the positional fitnesses in the region or sequence. Different positional fitness schemes can be used for these four aspects, with the limitation that negative logarithms of Boltzmann probabilities are only used for mutation, and product of Boltzmann probabilities cannot be used for mutation.

### Fitness and objective

Often fitness and objective of GAs are considered equivalent, but we make the distinction of using fitness for the selection in each round of the GA and objective for determining when an adequate solution has been found and the search can be terminated. In a standard design problem where the aim is to find a sequence folding to one specific target structure, it is natural to base the objective on whether positions are correct in the predicted structure and terminate when the number of errors reaches 0. However, a more fine grained selection may be desirable, for example substituting or combining the number of errors with scheme 2 – instead of choosing randomly between two sequences with e.g. 10% positions that are wrong in the predicted structure, we would prefer the one with higher probabilities of positions being correct.

A global, i.e. non-positional, scheme

Scheme 8. Logarithm of structure probabilities in Boltzmann ensemble and their variance:
where
and *x*
_{
τ
}= −log*p*
_{
τ
}is the negative logarithm of the probability of target structure *τ* in the Boltzmann ensemble of sequence *s*, and *ξ* ≥ 0 is the weight assigned to the contribution from the variance

based on the cost functions discussed in
[14] and corresponding to the *Π*(*s*
*τ*) objective in
[10], is also available for defining fitness and objective. This provides a means of requiring an exact match to the target structures, rather than basing the fitness on the distance to predicted or expected structure as in e.g. schemes 1 and 2. This is particularly relevant when designing for multiple structures at the same or very similar temperatures, where some of the position based schemes may be confused by designs where positions exhibit a good match to varying subsets of the target structures.

Finally, to maintain diversity in the GA population, the fitness can be augmented with a weighted contribution from the average Hamming distance to already selected sequences. If
denotes the set of sequences already selected in the selection stage at the end of a generation in the GA, each remaining candidate sequence *s* has its fitness augmented by
, where *h*(*s*,*t*) is the Hamming distance between sequences *s* and *t*, and *ζ* > 0 is the weight assigned to the contribution from diversity, before selecting the next individual carried forward.

### Mutation

The position targeted for mutation in a sequence is chosen either uniformly at random, or with probability proportional to positional fitnesses. Similarly, sequences can be chosen for mutation either equally many times, uniformly at random, or with probability proportional to the reciprocal of the sequence fitness (with sequences with fitness 0 given twice the probability of the otherwise most fit sequences).

When choosing a new nucleotide for a position chosen for mutation, we want to maintain *compatibility* with all target structures. That is, the modified sequence should fold into each target structure using only canonical Watson-Crick and GU wobble base pairs. When the target consists of a single structure, unpaired positions can be updated independent of the rest of the sequence, while base pairing positions can be updated by sampling a new base pair for the two positions independent of the rest of the sequence. However, with two or more target structures, dependencies can extend to more positions, as a position can be base paired to multiple other positions in the different target structures.

The *target dependency graph* (TDG) (denoted *dependency graph* in
[14]) implied by
is the graph on nodes {1,…,*n*} where two nodes *i, j* are connected by an edge iff
, where *i* · *j*denotes a base pairing of positions *i* and *j*. The compatibility of a position will depend, directly or indirectly, on all positions in the connected component it belongs to in the TDG, so the entire connected component may have to be updated in a mutation. It can be observed from (
[14], Theorems 1 & 2), that with canonical base pairing compatible sequences will exist iff the TDG is bipartite, and if
the TDG will be bipartite. If
, no compatible sequence may exist. For example, if three target structures contain base pairs *i*·*j*, *j*·*k*, and *i*·*k*, respectively, then there is no assignment of nucleotides to positions *i, j*, and *k* that will leave all base pairs canonical.

In
[14] formulas for sampling an assignment of nucleotides on a connected component when the maximum degree of any node is at most 2 is provided. However, we have chosen a simpler, heuristic update algorithm for two reasons. First, our method was developed to also cope with larger sets of target structures. Secondly, even for
the maximum node degree may be 3 and hence one may suspect assignment of nucleotides, i.e. colouring of the nodes, uniformly at random to be difficult – with a three letter nucleotide alphabet with base pairs allowed between any two non-identical nucleotides, the problem becomes *#*
**P** hard
[17]. Finally, it is unclear whether sampling uniformly from the set of compatible assignments is the best strategy. As G’s and U’s can pair with two other types of nucleotides, while C’s and A’s can pair with only one other type, the set of compatible assignments will be biased towards a high GU content.

The following forms our method for sampling nucleotides on a connected component in the TDG, starting with position *i*, ensuring all positions form canonical base pairs following the update.

Choose *σ*from {A,C,G,U}∖{*s*[*i*]} and set *s*[*i*] = *σ*

*F* = {*i*}, *N* = {*j*∣*i* · *j*}

**while**
*N* ≠ * ∅ *
**do**

Choose *j* ∈ *N* uniformly at random

Choose *σ* from
and set *s*[*j*] = *σ*

*F* = *F*∪{*j*}, *N* = *N*∪{*k*∣*j*·*k*}∖*F*

**end while**

where *j* · *k* denotes that two nodes are connected by an edge in the TDG and
is the set of nucleotides compatible with *σ*. It performs a traversal of the connected component of *i*, at each step choosing a random node neighbouring the already updated nodes. New nucleotides are drawn from a distribution which can be specified – if the default uniform distribution is used, this will tend to favour high GU content for the same reason as for choosing complete compatible assignments uniformly at random discussed above – truncated to the possibilities allowed. If the current nucleotide is among the choices, there is an option either always to keep the current nucleotide (to limit collateral effects of a mutation) or to bias the draw with 1 − *f* [*i*] for the current nucleotide and *f* [*i*] for alternatives (to allow the current positional fitness to affect the probability of a change).

### Recombination

Due to the hierarchical nature of RNA secondary structures, the GA uses recombination mimicking gene conversion rather than cross over, i.e. an infix of one sequence is recombined with the corresponding prefix and suffix of the other sequence. The easiest way to keep all base pairs canonical, is to always take two positions forming a base pair in a target structure from the same sequence. If we create a recombinant on sequences *s* and *t* from a crossover point *i*,*j* by forming the sequence *s*[1*..i*] *t* [*i* + 1*..j*] *s* [*j* + 1*..n*], no base pairing positions come from different sequences iff there are no base pairs in the target structure(s) for which one of *i* + 1 and *j* is inside the base pair and the other one is outside the base pair. This means that we can partition break points into sets of *pairwise permissible points*, such that the aim of taking base pairing positions from the same sequence is achieved iff crossover points are chosen as pairs of points from the same set in this partition. For single structure targets, these sets are exactly the loops of the structure, when stacking base pairs are viewed as internal loops of size 0, and the set of all external positions. The following outlines the procedure used for constructing the sets of pairwise permissible points, where sets of size 1 are discarded. The target is denoted by
, and when the algorithm terminates *C* is the set of non-singleton sets of points that are pairwise permissible.

*C*={{0,…,*n*}}

**for**
**do**

*C*
^{
′
}= *∅*

for *S* ∈ *C*
**do**

*C*
^{
′
}= *C*
^{
′
}∪

(*X* ∈ {* S*∩[*i*, *j*)[ ,*S*∩ ([0,*i*[∪[*j*,*n*])}:|*X*|>1

**endfor**

*C* = *C*
^{
′
}

**end for**

As a starting point, pairs of points are chosen by first choosing a set of pairwise permissible points with probability proportional to the set size, then choosing a pair from the set uniformly at random, ensuring an overall uniform probability that a point is chosen. This distribution can be biased proportional to

where *f*
_{
s
} and *f*
_{
t
} are positional fitnesses for *s* and *t* respectively. Similarly, the pair of sequences *s, t* can be chosen uniformly at random, based on individual fitness as described for mutations above, or based on the sum over all pairs *i, j* of permissible points of
to preferentially choose pairs of sequences complementing each other.

### Data

Data was taken from two main sources, to benchmark Frnakenstein and other inverse folding methods. The first data set used in our benchmarks is the data set used in
[12]. This was downloaded from the MODENA website. It consists of a structure from each of the 29 out of the first 30 families in Rfam
[19], with the tmRNA family (RF00023) left out due to a high content of pseudoknot forming base pairs. We refer to this data set throughtout as the *Rfam* data set.

Secondly, data was taken from RNASTRAND
[20], which itself takes data from many sources
[21–25]. The data was filtered so that the sequences and structures could ensure reliability of predictions. We removed identical sequences and disregarded synthetic data and sequences with ambiguous base pairs. Further, any sequences with greater than 80% base pair similarity with another structure in the data set were removed, as well as all sequences with pseudoknots, as RNAfold does not predict pseudoknots. The resulting data set consisted of 397 RNA molecules, containing 363 unique secondary structures with a total length of 55,025, which we refer to as the *RNASTRAND* data set.

However, with both data sets, it may be possible that there is no sequence which RNAfold will fold into the reference structure, and so the method might not be able to acheive 100% accuracy, due to RNAfold, not the search heuristic. Consequently the sequences corresponding to the structures in the RNASTRAND data set were re-folded using RNAfold, so there is known to be at least one sequence which will correctly fold. This dataset will be denoted as the *RNASTRAND-Refolded*, and consists of 383 unique structures with a total length of 56,606.