Activation maximization design methods
In Fig. 1 and throughout the paper, we compare four different activation maximization methods for sequences: (1) Fast SeqProp (Our method)—The modified activation maximization method which combines the logit normalization scheme of Eqs. 7–8 with the softmax straightthrough estimator of Eqs. 5–6, (2) PWM—The original method with continuous softmaxrelaxed inputs [44], (3) SeqProp—The categorical sampling method described in [29] using the (nonnormalized) softmax straightthrough gradient estimator, and (4) Fast PWM—A logitnormalized version of the softmaxrelaxed method.
Starting with a randomly initialized logit matrix \(\varvec{l}\), for the methods PWM and Fast PWM we optimize \(\varvec{l}\) using the softmax relaxation \(\sigma (\varvec{l})\) from Eq. 3. For SeqProp and Fast SeqProp, we optimize \(\varvec{l}\) using the discrete nucleotide sampler \(\delta (\varvec{l})\) from Eq. 5. We define the optimization loss (or the ’train’ loss) as:
$$\begin{aligned} \mathcal {L}_{\text {train}}(\varvec{l}) =  \mathcal {P}(\varvec{x}(\varvec{l})) \end{aligned}$$
For PWM and Fast PWM, \(\varvec{x}(\varvec{l}) = \sigma (\varvec{l})\). For SeqProp and Fast SeqProp, \(\varvec{x}(\varvec{l}) = \delta (\varvec{l})\).
For Fast SeqProp we use the scaled, normalized logits \(\varvec{l}^{(\text {scaled})}\) (Eqs. 7–8) as parameters for the sampler \(\delta\) defined in Eq. 5. As such, we minimize the above loss with respect to \(\varvec{l}_{ij}\), \(\gamma _{j}\) and \(\beta _{j}\) (or \(\gamma\) and \(\beta\) for proteins). Using the softmax ST estimator from Eq. 6, we arrive at the following gradients for Fast SeqProp:
$$\begin{aligned} \frac{\partial \mathcal {P}(\delta (\varvec{l}^{(\text {scaled})}))}{\partial \varvec{l}_{ij}}&= \sum _{k=1}^{M} \frac{\partial \mathcal {P}(\delta (\varvec{l}^{(\text {scaled})}))}{\partial \delta (\varvec{l}^{(\text {scaled})})_{ik}} \cdot \frac{\partial \sigma (\varvec{l}^{(\text {scaled})})_{ik}}{\partial \varvec{l}_{ij}^{(\text {scaled})}} \cdot \gamma _{j} \cdot \frac{\partial \varvec{l}_{ij}^{(\text {norm})}}{\partial \varvec{l}_{ij}} \end{aligned}$$
(10)
$$\begin{aligned} \frac{\partial \mathcal {P}(\delta (\varvec{l}^{(\text {scaled})}))}{\partial \gamma _{j}}&= \sum _{i=1}^{N} \sum _{k=1}^{M} \frac{\partial \mathcal {P}(\delta (\varvec{l}^{(\text {scaled})}))}{\partial \delta (\varvec{l}^{(\text {scaled})})_{ik}} \cdot \frac{\partial \sigma (\varvec{l}^{(\text {scaled})})_{ik}}{\partial \varvec{l}_{ij}^{(\text {scaled})}} \cdot \varvec{l}_{ij}^{(\text {norm})} \end{aligned}$$
(11)
$$\begin{aligned} \frac{\partial \mathcal {P}(\delta (\varvec{l}^{(\text {scaled})}))}{\partial \beta {j}}&= \sum _{i=1}^{N} \sum _{k=1}^{M} \frac{\partial \mathcal {P}(\delta (\varvec{l}^{(\text {scaled})}))}{\partial \delta (\varvec{l}^{(\text {scaled})})_{ik}} \cdot \frac{\partial \sigma (\varvec{l}^{(\text {scaled})})_{ik}}{\partial \varvec{l}_{ij}^{(\text {scaled})}} \end{aligned}$$
(12)
The gradient equations are very similar for Fast PWM (the logitnormalized PWM method); the only difference is that the discrete sampler \(\delta\) in the forward pass is replaced by the standard softmax \(\sigma\). Similar design methods were published in parallel with (or shortly after) this work, including an editing method based on the GumbelSoftmax distribution [45] and other algorithms based on discretized activation maximization [46, 65]. See Figure S1E in the Additional file 1 for a comparison to optimization based on GumbelSoftmax.
The actual loss (or the ’test’ loss) is evaluated on the basis of discrete sequence samples drawn from the optimized softmax representation \(\sigma (\varvec{l})\), regardless of design method. In all four methods, we can use the categorical nucleotide sampler \(\delta (\varvec{l})\) to draw sequence samples and compute the mean test loss as:
$$\begin{aligned} \mathcal {L}_{\text {test}}(\{\varvec{l}^{(k)}\}_{k=1}^{K}) =  \frac{1}{K} \frac{1}{S} \sum _{k=1}^{K} \sum _{s=1}^{S} \mathcal {P}(\delta (\varvec{l}^{(k)})^{(s)}) \end{aligned}$$
Here S refers to the number of samples drawn from each softmax sequence \(\sigma (\varvec{l}^{(k)})\) at every weight update t, and K is the number of independent optimization runs. In all experiments, we set \(K = 10\) and \(S = 10\).
In addition to gradientbased methods, we compare Fast SeqProp to discrete search algorithms. The first method is a pairwise nucleotideswapping search (Evolution) [25], where sequence \(\varvec{x}\) is mutated with either 1 or, with a 50% chance, 2 random substitutions at each iteration, resulting in a new candidate sequence \(\varvec{x}'\). \(\varvec{x}'\) is only accepted if \(\mathcal {P}(\varvec{x}') > \mathcal {P}(\varvec{x})\). We also tested a wellknown meta heuristic—Simulated Annealing [66]—which has recently been demonstrated for sequencelevel protein design [3]. In Simulated Annealing, mutations are accepted even if they result in lower fitness with probability \(P(\varvec{x}', \varvec{x}, T)\), where T is a temperature parameter. Here we use the Metropolis acceptance criterion [67]:
$$\begin{aligned} P(\varvec{x}', \varvec{x}, T) = e^{(\mathcal {P}(\varvec{x})  \mathcal {P}(\varvec{x}')) / T} \end{aligned}$$
Adaptive sampling temperature with fast SeqProp
In Fast SeqProp, the scaling parameter \(\gamma _{j}\) adaptively adjusts the sampling entropy to control global versus local optimization. This can be deduced from the gradient components of \(\gamma _{j}\) in Eq. 11:

1
\(\frac{\partial \mathcal {P}(\delta (\varvec{l}^{(\text {scaled})}))}{\partial \delta (\varvec{l}^{(\text {scaled})})_{ik}}\) is positive for nucleotides which increase fitness and negative otherwise.

2
\(\frac{\partial \sigma (\varvec{l}^{(\text {scaled})})_{ik}}{\partial \varvec{l}_{ij}^{(\text {scaled})}}\) is positive when \(j=k\) and negative otherwise.

3
\(\varvec{l}_{ik}^{(\text {norm})}\) is positive only when we are likely to sample the corresponding nucleotide.
Here, the product of the first two terms, \(\frac{\partial \mathcal {P}(\delta (\varvec{l}^{(\text {scaled})}))}{\partial \delta (\varvec{l}^{(\text {scaled})})_{ik}} \cdot \frac{\partial \sigma (\varvec{l}^{(\text {scaled})})_{ik}}{\partial \varvec{l}_{ij}^{(\text {scaled})}}\), is positive if \(j = k\) and nucleotide j raises fitness or if \(j \ne k\) and nucleotide k lowers fitness. Put together, the gradient for \(\gamma _{j}\) increases when our confidence \(\varvec{l}_{ij}^{(\text {norm})}\) in nucleotide j is consistent with its impact on fitness, such that \(\text {sign}\left( \sum _{k=1}^{M} \frac{\partial \mathcal {P}(\delta (\varvec{l}^{(\text {scaled})}))}{\partial \delta (\varvec{l}^{(\text {scaled})})_{ik}} \cdot \frac{\partial \sigma (\varvec{l}^{(\text {scaled})})_{ik}}{\partial \varvec{l}_{ij}^{(\text {scaled})}}\right) = \text {sign}\left( \varvec{l}_{ij}^{(\text {norm})}\right)\). Conversely, inconsistent nucleotides decrement the gradient. At the start of optimization, \(\gamma _{j}\) is small, leading to high PWM entropy and large jumps in sequence design space. As we sample consistent nucleotides and the entropy gradient \(\frac{\partial \mathcal {P}(\delta (\varvec{l}^{(\text {scaled})}))}{\partial \gamma _{j}}\) turns positive, \(\gamma _{j}\) increases. Larger \(\gamma _{j}\) lowers the entropy and leads to more localized optimization. However, if we sample sufficiently many inconsistent nucleotides, the gradient of \(\gamma _{j}\) may turn negative, again raising entropy and promoting global exploration.
Note that, in the context of protein design where we have a single scale \(\gamma\) and offset \(\beta\), the gradient expressions from Eqs. 11 and 12 are additively pooled across all M channels. The argued benefits of instancenormalization above thus holds true for layernormalization as well.
VAEregularized fast SeqProp
In the main paper (Fig. 3), we use a variational autoencoder (VAE) [59] to regularize the sequence design when running Fast SeqProp. Similar regularization techniques based on VAEs have previously been employed by [37, 39]. The original optimization objective (Eq. 1) is extended by passing the sampled onehot pattern \(\delta (\varvec{l})\) to the VAE and estimating its marginal likelihood, \(p_{\text {VAE}}(\delta (\varvec{l}))\), using importanceweighted inference. We then minimize a margin loss with respect to the mean likelihood \(p_{\text {ref}}\) of the original training data to keep sequence designs indistribution, using the Softmax ST estimator to propagate gradients back to \(\varvec{l}\):
$$\begin{aligned} \min _{\varvec{l}}  \mathcal {P}(\delta (\varvec{l})) + \lambda \cdot \text {max}\big [\log _{10} p_{\text {ref}}  \log _{10} p_{\text {VAE}}(\delta (\varvec{l}))  \rho , 0 \big ] \end{aligned}$$
(13)
VAEregularized fast SeqProp with uncertaintyestimation
In the Additional file 1 (Figure S3E), we replicate the benchmark comparison of the main paper (Fig. 3), but we use oracle predictors capable of estimating the uncertainty in their fitness predictions to further regularize the designs [64]. Sequence design based on uncertainty estimators were originally proposed by [39, 68]. Assume that the oracle model predicts the mean \(\mu \big [\delta (\varvec{l})\big ]\) and standard deviation \(\epsilon \big [\delta (\varvec{l})\big ]\) of fitness scores for the designed (sampled) pattern \(\delta (\varvec{l})\). We then use the (differentiable) survival function of the normal distribution to maximize the probability \(p_{\mu [\delta (\varvec{l})], \epsilon [\delta (\varvec{l})]}(\mathbbm {Y} > q)\) that the predicted fitness of sequence \(\delta (\varvec{l})\) is larger than quantile q of the training data:
$$\begin{aligned} \min _{\varvec{l}}  \log _{10} p_{\mu [\delta (\varvec{l})], \epsilon [\delta (\varvec{l})]}(\mathbbm {Y} > q) + \lambda \cdot \text {max}\big [\log p_{\text {ref}}  \log _{10} p_{\text {VAE}}(\delta (\varvec{l}))  \rho , 0 \big ] \end{aligned}$$
(14)
This fitness objective is known as ’Probability of Improvement’ (PI) [69,70,71].
VAEregularized fast SeqProp with activityregularization
In the Additional file 1 (Figure S3F), we use the predictor MPRADragoNN to design maximally HepG2specific enhancer sequences, and use activityregularization on (some of) the internal layers of the predictor to regularize the optimization. We maximize the predicted fitness score \(\mathcal {P}(\delta (\varvec{l}))\) (and minimize the VAEloss as before) while also minimizing a margin loss applied to the sum of a subset of convolutional activation maps \(\mathcal {C}_{k}(\delta (\varvec{l}))\):
$$\begin{aligned} \begin{aligned} \min _{\varvec{l}}& \mathcal {P}(\delta (\varvec{l})) + \lambda \cdot \text {max}\big [\log _{10} p_{\text {ref}}  \log _{10} p_{\text {VAE}}(\delta (\varvec{l}))  \rho , 0 \big ]\\&+ \eta _{1} \cdot \text {max}\big [\mathcal {C}_{1}(\delta (\varvec{l}))  C_{1}, 0 \big ] + ... + \eta _{K} \cdot \text {max}\big [\mathcal {C}_{K}(\delta (\varvec{l}))  C_{K}, 0 \big ] \end{aligned} \end{aligned}$$
(15)
Predictor models
We designed sequences for five distinct DNA or RNA deep learning predictors. For each of these models, we defined one of their (potentially many) outputs as the classification or regression score \(\mathcal {P}(\varvec{x}) \in \mathbb {R}\) to maximize in Eq. 1. We also designed protein sequences according to a 3D protein structure predictor. Here is a brief description of each fitness predictor:
DragoNN Predicts the probability of SPI1 transcription factor (TF) binding within a 1000nt sequence. We define \(\mathcal {P}(\varvec{x})\) as the logit score of the network output. The trained model was downloaded from:^{Footnote 1}.
DeepSEA [22] Predicts multiple TF binding probabilities and chromatin modifications in a 1000nt sequence. We define \(\mathcal {P}(\varvec{x})\) as the logit score of the CTCF (Dnd41) output. The trained model was downloaded from:^{Footnote 2}.
APARENT [29] Predicts proximal alternative polyadenylation isoform abundance in a 206nt sequence. We define \(\mathcal {P}(\varvec{x})\) as the logit score of the network output. The trained model was downloaded from:^{Footnote 3}.
MPRADragoNN [24] Predicts transcriptional activity of a 145nt promoter sequence. We define \(\mathcal {P}(\varvec{x})\) as the sixth output (SV40) of the ’Deep Factorized’ model. The trained model was downloaded from:^{Footnote 4}.
Optimus 5’ [25] Predicts mean ribosome load in a 50nt sequence. \(\mathcal {P}(\varvec{x})\) is the (nonscaled) output of the ’evolution’ model. The trained model was downloaded from:^{Footnote 5}.
trRosetta [34] Predicts amino acid residue distance distributions and angle distributions of the input primary sequence. We defined the optimization objective as minimizing the mean KLdivergence between the predicted distance and angle distributions of the designed sequence compared to a target structure (see the definition in Section ’Protein Structure Optimization’ of the main paper). The trained model was downloaded from:^{Footnote 6}.
All optimization experiments were carried out in Keras (Chollet, 2015) using Adam with default parameters [72]. Some predictor models were ported using pytorch2keras.
Validation models
When designing sequences for the predictor models listed in the previous section, we computed validation scores based on the following heldout models (i.e. models we did not explicitly optimize for):
DeeReCTAPA [31] Predicts relative isoform abundances for multiple competing polyadenylation signals. The model was trained on mouse 3’ sequencing data. We used the model to score a particular designed polyadenylation signal by predicting its relative use when competing with a strong, fixed distal polyadenylation signal. The model was trained using the code repository at:^{Footnote 7}.
DeepPASTA [30] Predicts relative isoform abundance of two competing polyadenylation signals. Several model versions exists, we used the one trained on human brain tissue 3’ sequencing data. To score a particular designed polyadenylation signal, we predicted its relative use when competing with a strong, fixed distal signal. The trained model was downloaded from:^{Footnote 8}.
iEnhancerECNN [62] Detects genomic enhancer regions and predicts whether it is a weak or strong enhancer. We used the product of these two probability outputs to score each designed enhancer sequence. The model was trained using the code repository at:^{Footnote 9}.
EnhancerP2L [63] Detects genomic enhancer regions and predicts whether it is a weak or strong enhancer. For a sample of generated sequences per design method, we calculated the mean detect/not detect prediction rate, the mean weak/strong prediction rate and the mean pscore. The model was available via a web application at:^{Footnote 10}.
Retrained Optimus 5’ [25] A retrained version of Optimus 5’, where the training data had been complemented with extreme sequences (such as long singlenucleotide repeats, etc.). The trained model was downloaded from:^{Footnote 11}.
Auxiliary models
In Fig. 3, we trained a variational autoencoder (VAE) [59] and a generative adversarial network (GAN) [60] on a subset of the data that was originally used to train each of the predictor oracles APARENT, MPRADragoNN and Optimus 5’. For each design task, we selected a sample of 5000 sequences with highest observed fitness and a sample of 5000 randomly selected sequences. The VAE, which was based on a residual network architecture [73], was trained on the highfitness subset of sequences. The WGAN, which was based on the architecture of Gupta et al. [38], was trained on the random subset of sequences.
Other design methods
A selection of design methods were used for benchmark comparisons in Fig. 3. Here we describe how they were executed and what parameter settings were used:
CbAS [39] The procedure was started from the VAE which had been pretrained on the highfitness dataset. It was executed for 150 rounds and, depending on design task, either 100 or 1000 sequences were sampled and used for weighted retraining at the end of each round (whichever resulted in higher fitness scores). The threshold was set to either the 60th or 80th pecentile of fitness scores predicted on the training data (whichever resulted in more stable fitness score trajectories). The VAE was trained for either 1 or 10 epochs at the end of each round (whichever resulted in more stable fitness scores—for some tasks, the fitness scores would drop abruptly after only a few sampling rounds when training the VAE for 10 epochs per round). For the benchmark comparison in the main paper, the standard deviation of the predictions were set to a small constant value ranging between 0.02 and 0.1, depending on application (since none of the pretrained oracles APARENT, MPRADragoNN or Optimus 5’ predicts deviation, we used a small constant deviation that was \(\sim 50\)x smaller than the maximum possible predicted value). In the Additional file 1, where we use oracles with uncertainty estimation, we also supplied the predicted standard deviation to the CbAS survival function. The code was adapted from:^{Footnote 12}.
RWR [61] The procedure was started from the VAE which had been pretrained on the highfitness dataset. It was executed for 150 rounds and 100 or 1000 sequence samples were used for weighted retraining at the end of each round (whichever resulted in higher fitness scores). The VAE was trained for 10 epochs each round. The code was adapted from:^{Footnote 13}.
AMVAE [44] This method performs activation maximization by gradient ascent through a pretrained VAE in order to design sequences. The procedure was started from the VAE which had been pretrained on the highfitness dataset. Each sequence was optimized for 2000–5000 updates depending on design task (using the Adam optimizer). A normally distributed noise term was added to the gradients to help overcome potential local minima. The code was adapted from:^{Footnote 14}.
FBGAN [38] The FBGAN procedure was started from the WGAN which had been pretrained on a random sample of sequences. The method was executed for 150 epochs and 960 sequences were sampled and used for feedback at the end of each epoch. We either set the feedback threshold to a fixed value (the 80th percentile of fitness scores predicted on the highfitness dataset), or we adaptively reset the threshold to a certain percentile as measured on the 960 sampled sequences at the end of each epoch. The code was adapted from:^{Footnote 15}.
FBVAE [38] A VAEbased version of the FBGAN. The procedure was started from the VAE which had been pretrained on the highfitness dataset. It was executed for 150 epochs and 100 or 1000 sequence samples were used for feedback at the end of each epoch (whichever resulted in higher fitness scores). A fixed threshold was used (either the 60th or 80th percentile as predicted on the highfitness data). The code was adapted from:^{Footnote 16}.
Graph tools
All graphs were made with Matplotlib [74].