Fast activation maximization for molecular sequence design

Background Optimization of DNA and protein sequences based on Machine Learning models is becoming a powerful tool for molecular design. Activation maximization offers a simple design strategy for differentiable models: one-hot coded sequences are first approximated by a continuous representation, which is then iteratively optimized with respect to the predictor oracle by gradient ascent. While elegant, the current version of the method suffers from vanishing gradients and may cause predictor pathologies leading to poor convergence. Results Here, we introduce Fast SeqProp, an improved activation maximization method that combines straight-through approximation with normalization across the parameters of the input sequence distribution. Fast SeqProp overcomes bottlenecks in earlier methods arising from input parameters becoming skewed during optimization. Compared to prior methods, Fast SeqProp results in up to 100-fold faster convergence while also finding improved fitness optima for many applications. We demonstrate Fast SeqProp’s capabilities by designing DNA and protein sequences for six deep learning predictors, including a protein structure predictor. Conclusions Fast SeqProp offers a reliable and efficient method for general-purpose sequence optimization through a differentiable fitness predictor. As demonstrated on a variety of deep learning models, the method is widely applicable, and can incorporate various regularization techniques to maintain confidence in the sequence designs. As a design tool, Fast SeqProp may aid in the development of novel molecules, drug therapies and vaccines. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04437-5.

A wide range of methods have previously been applied to computational sequence design, for example Genetic Algorithms (Deaton et al., 1996), Simulated Annealing (Hao et al., 2015), Bayesian optimization (Belanger et al., 2019) and population-based methods (Xiao et al. 2009;Ibrahim et al., 2011;Mustaza et al., 2011;Angermueller et al., 2020).More recently, methods combining adaptive sampling or other optimization techniques with deep generative networks have been used to model distributions of sequences with desired functional properties (Gómez-Bombarelli et al., 2018;Gupta et al., 2019;Brookes et al., 2019;Yang et al., 2019;Riesselman et al., 2019;Costello et al., 2019, Linder et al. 2020), including Deep Exploration Networks (DENs), developed in earlier work by our group.While powerful, these methods first require selecting an appropriate generative network and tuning several hyper-parameters.
Perhaps the simplest approach to sequence design based on a differentiable fitness predictor is to optimize the input pattern by gradient ascent (Lanchantin et al., 2016;Killoran et al., 2017;Gómez-Bombarelli et al., 2018;Bogard et al., 2019).The method, commonly referred to as activation maximization, uses the gradient of the neural network output to make incremental changes to the input.While simple, activation maximization cannot be directly applied to sequences as they are discrete and not amenable to gradient ascent.Several extensions have been proposed to rectify this; Killoran et al., (2017) used a softmax layer to turn the sequences into continuous relaxations, and in previous work we developed SeqProp which used straight-through gradients to optimize discrete samples (Bogard et al., 2019).However, as our results indicate below, both methods converge slowly.Furthermore, continuous input relaxations may cause predictor pathologies leading to poor designs.
Here, we combine discrete nucleotide sampling and straight-through approximation with normalization across the parameters -nucleotide logits -of the sampling distributions, resulting in a markedly improved sequence design method which we refer to as Fast SeqProp.We apply Fast SeqProp to a range of DNA and protein design tasks including the design of strong enhancers, 5'UTRs, alternative polyadenylation signals and protein structures.We demonstrate up to a 100-fold optimization speedup, and improved optima, for these design tasks compared to prior methods based on activation maximization.We independently validate designed sequences by scoring them with models not used in the optimization procedure.We further show that our method can outperform global search heuristics such as Simulated Annealing and more recent sampling-based methods based on generative models.Unlike the latter approaches, Fast SeqProp does not require training of an independent generator.It is thus model-and parameter-free, making it easy to use when designing smaller sequence sets.Moreover, Fast SeqProp can incorporate many different regularization techniques to design sequences that are not too distant from the original training data and thus maintain confidence, such as regularization based on a variational autoencoder (VAE) and optimization of probabilistic predictor models which are capable of estimating uncertainty.

Background
Given a sequence-predictive neural network P and an initial input pattern x (0) , the gradient ascent method seeks to maximize the predicted fitness P(x) ∈ R by tuning the input pattern x: Assuming P is differentiable, we can compute the gradient ∇ x P(x) with respect to the input and optimize x by updating the variable in the direction of the fitness gradient (Simonyan et al., 2013): However, sequences are usually represented as one-hot coded patterns (x ∈ {0, 1} N ×M , where N is the sequence length and M the number of channels; M = 4 for nucleic acids and M = 20 for proteins), and discrete variables cannot be optimized by gradient ascent.Several different reparameterizations of x have been proposed to bypass this issue.In one of the earliest implementations, Lanchantin et al. (2016) represented the sequence as an unstructured, real-valued pattern (x ∈ R N ×M ) but imposed an L2-penalty on x in order to keep it from growing too large and causing predictor pathologies.After optimization, this real-valued pattern is interpreted as a sequence logo from which samples can be drawn.However, the method was introduced mainly as a visualization tool rather than a sequence design approach.Killoran et al. (2017) later introduced a softmax reparameterization, turning x into a continuous relaxation σ(l): Here l ij ∈ R are differentiable nucleotide logits.The gradient of σ(l) with respect to l is defined as: Given Equation 3 and 4, we can maximize P(σ(l)) with respect to the logits l using the gradient ∇ l P (σ(l)).While elegant, there are two issues with this architecture.First, the gradient in Equation 4becomes vanishingly small for large values of l ij (when σ(l) ik ≈ 0 or σ(l) ij ≈ 1), halting convergence.Second, sequence-predictive neural networks have only been trained on discrete one-hot coded patterns and the predictive power of P may be poor on a continuous relaxation such as σ(l).
Following advances in gradient estimators for discretized neurons (Bengio, Léonard and Courville, 2013;Courbariaux et al., 2016), we developed SeqProp, a version of activation maximization that replaces the softmax transform σ with a discrete, stochastic sampler δ (Bogard et al., 2019): Here, Z i ∼ σ(l) i is a randomly drawn categorical nucleotide at the i:th position from the (softmax) probability distribution σ(l) i .The nucleotide logits l ij are thus used as parameters to N categorical distributions, from which we sample nucleotides {Z i } N i=1 and construct a discrete, one-hot coded pattern δ(l) ∈ {0, 1} N ×M .While δ(l) is not directly differentiable, l can be updated based on the estimate of ∇ l P(δ(l)) using straight-through (ST) approximation.Rather than using the original ST estimator of (Bengio et al. 2013), we here adopt an estimator with theoretically better properties from (Chung et al., 2016) where the gradient of δ(l) ij is replaced by that of the softmax σ(l) ij : Using discrete samples as input to P removes any pathologies from continuous input relaxations.
But, as we show below, convergence remains almost as slow as the softmax method.Switching to the original ST estimator ( ∂δ(l)ij ∂lij = 1) speeds up convergence, but results in worse fitness optima (see Supplementary Information, Figure S1G for a comparison).

Fast Stochastic Sequence Backpropagation
Inspired by instance normalization in image GANs (Ulyanov et al., 2016), we hypothesized that the main bottleneck in earlier design methods -both in terms of optimization speed and minima foundstem from overly large and disproportionally scaled nucleotide logits.Here, we mitigate this problem by normalizing the logits across positions, i.e. we insert a normalization layer between the trainable logits l ij and the sampling layer δ(l) ij (Figure 1A).
For DNA sequence design, where the number of one-hot channels M is small (M = 4), we use a normalization scheme commonly referred to as instance-normalization.In this scheme, the nucleotide logits of each channel are normalized independently across positions.Let μj = 1 N N i=1 l ij and εj = 1 N N i=1 (l ij − μj ) 2 be the sample mean and deviation of logits for nucleotide j across all positions i.For each step of gradient ascent, we compute the normalized logits l (norm) ij as: Since logits with zero mean and unit variance have limited expressiveness when used as parameters to a probability distribution, we associate each channel j with a global scaling parameter γ j and offset β j .Having an independent offset β j per channel is particularly well-suited for DNA, as nucleotides are often associated with a global preferential bias.The scaled, re-centered logits are calculated as: For protein sequence design, the number of one-hot channels M is considerably larger (M = 20), resulting in fewer samples per channel and noisier normalization statistics.Here we found that layernormalization was more stable: We compute a global mean μ = 1  Given the normalized and scaled logits l (scaled) as parameters for the nucleotide sampler δ defined in Equation 5, we maximize P(δ(l (scaled) )) with respect to l ij , γ j and β j (or γ and β in the context of proteins).Using the softmax ST estimator from Equation 6, we arrive at the following gradients: The normalization removes logit drift by keeping the values proportionally scaled and centered at zero (E[l ] = 1), enabling the gradients to swap nucleotides with few updates.Furthermore, the scaling parameter γ j adaptively adjusts the sampling entropy to control global versus local optimization.This can be deduced from the gradient components of γ j in Equation 10: 1. ∂P(δ(l (scaled) )) ∂δ(l (scaled) ) ik is positive for nucleotides which increase fitness and negative otherwise.
is positive when j = k and negative otherwise.

l (norm) ik
is positive only when we are likely to sample the corresponding nucleotide.
Here, the product of the first two terms, ∂P(δ(l (scaled) )) , is positive if j = k and nucleotide j raises fitness or if j = k and nucleotide k lowers fitness.Put together, the gradient for γ j increases when our confidence l (norm) ij in nucleotide j is consistent with its impact on fitness, such that sign . Conversely, inconsistent nucleotides decrement the gradient.At the start of optimization, γ j is small, leading to high PWM entropy and large jumps in sequence design space.As we sample consistent nucleotides and the entropy gradient turns positive, γ j increases.Larger γ j lowers the entropy and leads to more localized optimization.However, if we sample sufficiently many inconsistent nucleotides, the gradient of γ j may turn negative, again raising entropy and promoting global exploration.
Note that, in the context of protein design where we have a single scale γ and offset β, the gradient expressions from Equation 10 and 11 are additively pooled across all M channels.The argued benefits of instance-normalization above thus holds true for layer-normalization as well.

Maximizing Nucleic Acid Sequence-Predictive Neural Networks
We first evaluate our method on the task of maximizing the classification or regression scores of five DNA-or RNA-level deep neural network predictors: (1) DragoNN, a model trained on ChIP-seq data to predict Transcription Factor (TF) binding (in this case binding of SPI1), ( 2) DeepSEA (Zhou et al., 2015), which predicts multiple TF binding probabilities and chromatin modifications (we use it here to maximize the probability of CTCF binding in the cell type Dnd41), ( 3) APARENT (Bogard et al., 2019), which predicts alternative polyadenylation isoform abundance given an input polyadenylation signal, (4) MPRA-DragoNN (Movva et al., 2019), a neural network trained to predict transcriptional activity of short enhancer sequences and, finally, (5) Optimus 5' (Sample et al., 2019), which predicts ribosomal load (translational efficiency) of 5' UTR sequences.
We compare our new logit-normalized, straight-through sampled sequence design method (Fast SeqProp) to the previous versions of the algorithm, namely the original method with continuous softmax-relaxed inputs (Killoran et al., 2017, here referred to as PWM) and SeqProp, the categorical sampling method described in (Bogard et al., 2019) using the (non-normalized) softmax straightthrough gradient estimator.We also tested a logit-normalized version of the softmax-relaxed method, Fast PWM, in order to disentangle the individual performance contributions of the normalization scheme and the sampling scheme with respect to the baseline PWM method.
Figure 1B shows the result of using the design methods to generate maximally scored sequences for each of the five DNA-based predictors.Fast SeqProp converges to 95% -99% of its minimum test loss within 2, 000 logit updates, and reaches 50% of the minimum loss after only 200 updates for all predictors except MPRA-DragoNN and Optimus 5'.In contrast, PWM and SeqProp do not converge within 20, 000 updates.Fast SeqProp converges to up to 3-fold better optima than all other compared methods.In fact, Fast SeqProp reaches the same or better optima in 200 updates than the competing methods reach in 20, 000 updates for DragoNN, MPRA-DragoNN and DeepSEA, marking a 100x speedup.For Optimus 5' and APARENT, the speedup is 20x-50x.In addition to gradient-based methods, we demonstrate improved performance compared to discrete search algorithms such as Simulated Annealing (see Supplementary Information, Figure S1A-B).
In the Supplementary Information, we provide additional technical comparisons of Fast SeqProp to previous activation maximization methods.For example, in Figure S1C, we demonstrate that certain sequence-predictive neural networks suffer from out-of-distribution (OOD) pathologies on continuous sequence relaxations as input, explaining the poor performance of the PWM design method.We further show that adding an entropy penalty to the PWM method still cannot close the performance gap to Fast SeqProp (Figure S1D) and that the Softmax ST estimator is superior to Gumbel Sampling (Figure S1E).Finally, we show that Fast SeqProp appears robust to the choice of optimizer parameters (Figure S1F) and that the Softmax ST estimator outperforms the original ST estimator (Figure S1G).

Recapitulating cis-Regulatory Biology with Activation Maximization
In Figure 2 we compare example sequence optimizations of the PWM and Fast SeqProp methods.As can be seen, even after 20, 000 updates, the PWM method has not converged for most of the tested predictors.In contrast, we find plenty of cis-regulatory motifs in the converged sequences generated by Fast SeqProp.Since our method was tasked with maximizing the predicted score of each model, we would expect to find enhancing motifs and regulatory logic embedded in the sequences which give rise to these extreme model responses.For example, when maximizing DragoNN, Fast SeqProp generates dual SPI1 binding motifs (Sandelin et al., 2004).For APARENT, Fast SeqProp generates multiple CFIm binding motifs, dual CSE hexamers, and multiple cut sites with CstF binding sites.These are all regulatory motifs known to enhance cleavage and polyadenylation by stimulating recruitment of the polyadenylation machinery (Di Giammartino et al., 2011;Shi et al., 2012;Elkon et al., 2013;Tian & Manley, 2017).For DeepSEA, Fast SeqProp generates four CTCF binding sites.For MPRA-DragoNN, we identify both CRE-and a CTCF binding sites embedded within a GC-rich context, which aligns well with what we might expect to find in a strong enhancer (Kheradpour et al., 2014;Ernst et al., 2016).Finally, for Optimus 5', Fast SeqProp generates a T-rich sequence with multiple in-frame (IF) uAUGs.These determinants were found to improve ribosome loading (Sample et al., 2019).See the Supplementary Information (Figure S2) for additional visualizations comparing the PWM and Fast SeqProp methods during different stages of optimization.

Regularized Sequence Design
While finding regulatory logic in the sequences produced by activation maximization is a good indication that we actually generate patterns with biological meaning, the method may still not be suitable in its most basic form for sequence design.There is the potential issue of overfitting to the predictor oracle during sequence optimization, as the oracle may lose its accuracy when drifting out of the training data distribution to maximize predicted fitness.By training a differentiable likelihood model, such as a variational autoencoder (VAE; Kingma et al. 2013), on samples from the same data and using it as a regularizer in the cost function, we can prevent drift to low-confidence regions of design space (Figure 3A; top).Using a VAE to avoid drift has previously been demonstrated by (Gómez-Bombarelli et al. 2018;Brookes et al., 2019;Linder et al. 2020).In summary, we extend the original optimization objective (Eq. 1) by passing the sampled one-hot pattern δ(l) to the VAE and penalize the pattern based on its VAE-estimated marginal likelihood, p VAE (δ(l)), using ST approximation for backpropagation (see Equation 13 in Methods).
The degree to which predictors exhibit pathological behavior when maximized seems to vary on a case-by-case basis and likely depends heavily on the data distribution.When designing maximally strong gene enhancers using the MPRA-DragoNN predictor, for example, VAE-regularization has a clear effect on shifting the distribution of the designed sequences (Figure 3A; bottom histograms).In contrast, when designing polyadenylation signals, VAE-regularization has no effect since nonregularized optimization already generates sequences that are at least as likely as training data according to the VAE (see Supplementary Information, Figure S3A).Next, we tasked the VAE-regularized Fast SeqProp method with designing maximally strong polyadenylation signals (using APARENT as the oracle), maximally transcriptionally active enhancer sequences (using MPRA-DragoNN as the oracle) and maximally translationally efficient 5' UTRs (using Optimus 5').For each task, we trained a β-VAE and W-GAN on a sample of 5, 000 high-fitness sequences (see Methods for details).We then used the methods CbAS (Brookes et al., 2019), FB-GAN (Gupta et al., 2019), AM-VAE (Killoran et al., 2017), RWR (Peters et al., 2007) and FB-VAE (VAE-version of FB-GAN) to maximize each oracle, using the VAE or GAN we trained earlier and default method parameters.We used the same VAE as the regularizer for our design method (Fast SeqProp).During optimization, we measured the fitness scores of both the oracle and a number of independent validation models that we did not directly optimize for, allowing us to estimate sequence fitness in an unbiased way.Specifically, when designing polyadenylation signals based on APARENT, we validated the designs using DeeReCT-APA (Li et al., 2020), an LSTM trained on 3'-sequencing data of mouse cells, and DeepPASTA (Arefeen et al., 2019), a CNN trained on 3'-sequencing data of multiple human tissues (we used the model trained on brain tissue).When designing enhancer sequences, we validated the designs using iEnhancer-ECNN (Nguyen et al., 2019), an ensemble of CNNs trained on genomic enhancer sequences, and EnhancerP-2L (Butt et al., 2020), a Random Forest-classifier based on statistical features extracted from enhancer regions in the genome.Finally, to validate Optimus 5' designs, we had access to a newer version of the model that had been trained on additional MPRA data, making it more robust particularly on outlier sequences such as long homopolymer stretches (Sample et al, 2019).On a practical note, we found it quite difficult to train a VAE on the APARENT, Optimus 5' and MPRA-DragoNN datasets, and the convergence of CbAS, RWR and FB-GAN appeared sensitive to quantile threshold settings, which we believe stem from the considerable data heterogeneity and variability.
The results (Figure 3B) show that Fast SeqProp reaches significantly higher oracle fitness scores and validation model scores with orders of magnitudes fewer calls to the oracle for all tasks except the 5' UTR design problem, where instead AM-VAE reaches high validation scores faster.The other methods either do not reach the same median validation score in the total allotted time, or do so at the expense of reduced diversity (see Supplementary Information, Figure S3B).For the polyadenylation signal design task, Fast SeqProp reaches identical DeeReCT-APA (Figure 3B; top right) and DeepPASTA (see Supplementary Information, Figure S3C) validation scores with or without VAE-regularization.The designed polyadenylation signal sequences look quite similar and share many motifs, such as multiple CFIm, CstF and CPSF binding sites (Figure 3C; top).For the enhancer design task, the VAE-regularization is clearly beneficial according to the validation model; while enhancers designed by Fast SeqProp without the VAE have a median MPRA-DragoNN score of 3.5, the median iEnhancer-ECNN score (Figure 3B; middle right) is just 0.43.With VAEregularization, we generate sequences with a lower median MPRA-DragoNN score (3.25), but higher iEnhancer-ECNN score (0.55).However, closer inspection reveals that Fast SeqProp does not consistently generate worse enhancers according to the validation model than its VAE-regularized counterpart.Rather, Fast SeqProp without VAE either generates highly scored enhancers by the validation model or lowly scored sequences, while Fast SeqProp with VAE consistently generates medium-scored enhancers (example sequences are shown in Figure 3C; middle).This dynamic is also observed with another validation model (EnhancerP-2L; see Supplementary Information, Figure S3D); Only 80% of Fast SeqProp (no VAE) sequences are identified by EnhancerP-2L as enhancers, while nearly 100% of Fast SeqProp-VAE sequences are identified.However, their weighted predicted enhancer strengths are identical.It is also worth noting that most other methods decrease their validation scores when increasing their MPRA-DragoNN scores; this is because they get stuck in a suboptimal, local minimum with pathological AT-repeats.Finally, VAE-regularization is beneficial for designing 5' UTRs, as it restricts the sequences from becoming overly T-rich, a sequence pathology present in the original Optimus 5' model which the retrained version understands actually decreases ribosome load (Figure 3B; bottom; Figure 3C; bottom).
In the Supplementary Information, we provide extra benchmark experiments comparing Fast SeqProp to a subset of the above design methods.In particular, in Figure S3E, we train the same kind of oracles as was used by Brookes et al. (2019) to estimate uncertainty in the fitness predictions (Lakshminarayanan et al. 2017), and use these models to replicate the polyadenylation signal and 5' UTR design benchmarks.We also replicate the GFP design task used in Brookes et al. (2019).Additionally, in Figure S3F, we include an example where we use MPRA-DragoNN to design maximally specific enhancers in the cell line HepG2 (and inactivated in K562), and show how internal network penalties can be used to regularize the sequence optimization when it is hard to train an uncertainty-estimator oracle that is sufficiently accurate.Multiple deep learning models have recently been developed for predicting tertiary protein structure (AlQuraishi, 2019;Senior et al., 2020;Yang et al., 2020).Here, we demonstrate our method by designing de novo protein sequences which conform to a target residue contact map as predicted by trRosetta (Yang et al., 2020).The predictor takes three inputs (Figure 4A): A one-hot coded sequence, a PSSM constructed from a multiple-sequence alignment (MSA) and a direct-coupling analysis (DCA) map.For our design task, we pass the optimizable one-hot pattern to the first two inputs and an all-zeros tensor as the DCA feature map.Given the predicted distance distribution D P ∈ [0, 1] N ×N ×37 and angle distributions θ P , ω P ∈ [0, 1] N ×N ×24 , φ P ∈ [0, 1] N ×N ×12 , we minimize the mean KL-divergence against target distributions D T , θ T , ω T and φ T :

Protein Structure Optimization
We compared SeqProp and Simulated Annealing to a modified version of Fast SeqProp, where logits are normalized across all residue channels (layer-normalized rather than instance-normalized) to reduce the increased variance of 20 one-hot channels.We used the methods to design protein sequences which conformed to the target structure of an example protein (Sensor Histidine Kinase).We optimized 5 independent sequences per design method and recorded the median KL-loss at each iteration.The results show that Fast SeqProp converges faster and reaches better minima (Figure 4B; see also Supplementary Information, Figure S4A); after 200 iterations, Fast SeqProp reached 4x lower KL-divergence than all other methods, and much of the target structure is visible (Figure 4C).While the choice of learning rate changes the rate of convergence, it does not alter the minima found by Fast SeqProp (Figure 4B).Additionally, by sampling multiple sequences at once and walking down the average gradient (e.g. 10 samples per update), we can improve the rate of convergence further by making the gradient less noisy (see Supplementary Information, Figure S4B).Importantly, this scales significantly better than linear in execution time, since multiple samples can be differentiated in parallel on a GPU.Finally, we replicated our results by designing sequences for a different protein structure (an alpha-helical hairpin; see Supplementary Information Figure S4C-E).

Discussion
By normalizing nucleotide logits across positions and using a global entropy parameter, Fast SeqProp keep logits proportionally scaled and centered at zero.The gradient of the entropy parameter γ in our design method adaptively adjusts the sampling temperature to trade off global and local optimization.
In the beginning, γ is small, corresponding to a high PWM entropy and consequently very diverse sequence samples.As optimization progresses, γ grows, leading to more localized sequence changes.This adaptive mechanism, in combination with flexible nucleotide logits due to the normalization, results in a highly efficient design method.As demonstrated on five deep learning predictors, logit normalization enables extremely fast sequence optimization, with a 50-to-100-fold speedup compared to state-of-the-art methods for many predictors, and with better predicted optima.Highlighting its broad usefulness, since this paper was first deposited on a pre-print server, several genomic and protein design papers based on variations of normalized and discretized activation maximization have been made available (Schreiber et al., 2020;Norn et al., 2020;Tischer et al., 2020).
In addition to logit drift and vanishing gradients, the original sequence ascent method suffers from predictor pathologies due to passing continuous softmax sequence relaxations as input, a problem fully removed by using discrete sampling.We further observed that straight-through sampling leads to consistently better optima than softmax relaxation, suggesting that it traverses local minima.In fact, our method outperformed global optimization meta heuristics such as Simulated Annealing on more difficult design tasks, such as designing 1000 nt long enhancer regions or designing protein sequences which conform to a complex target structure.We further demonstrated robust sequence design results on a number of tasks, including the design of strong alternative polyadenylation signals, efficiently translated 5' UTRs and enhancers that result in high transcriptional activity.We did so by incorporating a regularization penalty based on variational autoencoders and showed better and faster convergence than other regularized design methods.

Conclusion
We presented an improved version of gradient ascent (or activation maximization) for biological sequence design, called Fast SeqProp, which combines logit normalization with stochastic nucleotide sampling and straight-through gradients.We demonstrated the efficacy of the method on several DNA-, RNA-and protein design tasks.We expect this algorithmic improvement to be broadly useful to the research community for biomolecular optimization at the level of primary sequence.The approach introduced here could accelerate the design of functional enzymes and other biomolecules, potentially resulting in novel drug therapies, molecular sensors and more.

Methods Activation Maximization Design Methods
In Figure 1 and throughout the paper, we compare four different activation maximization methods for sequences: (1) Fast SeqProp (Our method) -The modified activation maximization method which combines the logit normalization scheme of Equation 7-8 with the softmax straight-through estimator of Equation 5-6, (2) PWM -The original method with continuous softmax-relaxed inputs (Killoran et al., 2017), (3) SeqProp -The categorical sampling method described in (Bogard et al., 2019) using the (non-normalized) softmax straight-through gradient estimator, and (4) Fast PWM -A logit-normalized version of the softmax-relaxed method.
Starting with a randomly initialized logit matrix l, for the methods PWM and Fast PWM we optimize l using the softmax relaxation σ(l) from Equation 3.For SeqProp and Fast SeqProp, we optimize l using the discrete nucleotide sampler δ(l) from Equation 5. We define the optimization loss (or the 'train' loss) as: For PWM and Fast PWM, x(l) = σ(l).For SeqProp and Fast SeqProp, x(l) = δ(l).
The actual loss (or 'test' loss) is evaluated on the basis of discrete sequence samples drawn from the optimized softmax representation σ(l), regardless of design method.In all four methods, we can use the categorical nucleotide sampler δ(l) to draw sequence samples and compute the mean test loss as: Here S refers to the number of samples drawn from each softmax sequence σ(l (k) ) at every weight update t, and K is the number of independent optimization runs.In all experiments, we set K = 10 and S = 10.
In addition to gradient-based methods, we compare Fast SeqProp to discrete search algorithms.
The first method is a pairwise nucleotide-swapping search (Evolution; Sample et al., 2019), where sequence x is mutated with either 1 or, with a 50% chance, 2 random substitutions at each iteration, resulting in a new candidate sequence x .x is only accepted if P(x ) > P(x).We also tested a well-known meta heuristic -Simulated Annealing (Kirkpatrick et al., 1983).In Simulated Annealing, mutations are accepted even if they result in lower fitness with probability P (x , x, T ), where T is a temperature parameter.Here we use the Metropolis acceptance criterion (Metropolis et al., 1953):

VAE-regularized Fast SeqProp
In the main paper (Figure 3), we use a variational autoencoder (VAE; Kingma et al. 2013) to regularize the sequence design when running Fast SeqProp.The original optimization objective (Eq. 1) is extended by passing the sampled one-hot pattern δ(l) to the VAE and estimating its marginal likelihood, p VAE (δ(l)), using importance-weighted inference.We then minimize a margin loss with respect to the mean likelihood p ref of the original training data to keep sequence designs in-distribution, using the Softmax ST estimator to propagate gradients back to l:

VAE-regularized Fast SeqProp with Uncertainty-estimation
In the Supplementary Information (Figure S3E), we replicate the benchmark comparison of the main paper (Figure 3), but we use oracle predictors capable of estimating the uncertainty in their fitness predictions to further regularize the designs (Lakshminarayanan et al. 2017).Assume that the oracle model predicts the mean µ δ(l) and standard deviation δ(l) of fitness scores for the designed (sampled) pattern δ(l).We then use the (differentiable) survival function of the normal distribution to maximize the probability p µ[δ(l)], [δ(l)] (Y > q) that the predicted fitness of sequence δ(l) is larger than quantile q of the training data:

VAE-regularized Fast SeqProp with Activity-regularization
In the Supplementary Information (Figure S3F), we use the predictor MPRA-DragoNN to design maximally HepG2-specific enhancer sequences, and use activity-regularization on (some of) the internal layers of the predictor to regularize the optimization.We maximize the predicted fitness score P(δ(l)) (and minimize the VAE-loss as before) while also minimizing a margin loss applied to the sum of a subset of convolutional activation maps C k (δ(l)): relative use when competing with a strong distal signal.The trained model was downloaded from: https://www.cs.ucr.edu/ãaref001/DeepPASTA_site.html.
iEnhancer-ECNN: (Nguyen et al., 2019) Detects genomic enhancer regions and predicts whether it is a weak or strong enhancer.We used the product of these two probability outputs to score each designed enhancer sequence.The model was trained using the code repository at: https://github.com/ngphubinh/enhancers.
EnhancerP-2L: (Butt et al., 2020) Detects genomic enhancer regions and predicts whether it is a weak or strong enhancer.For a sample of generated sequences per design method, we calculated the mean detect/not detect prediction rate, the mean weak/strong prediction rate and the mean p-score.

Auxiliary Models
In Figure 3, we trained a variational autoencoder (VAE) and a generative adversarial network (GAN) on a subset of the data that was originally used to train each of the predictor oracles APARENT, MPRA-DragoNN and Optimus 5'.For each design task, we selected a sample of 5, 000 sequences with highest observed fitness and a sample of 5, 000 randomly selected sequences.The VAE, which was based on a residual network architecture (He et al., 2016), was trained on the high-fitness subset of sequences.The W-GAN, which was based on the architecture of Gupta et al., (2019), was trained on the random subset of sequences.

Other Design Methods
A selection of design methods were used for benchmark comparisons in Figure 3.Here we describe how they were executed and what parameter settings were used: CbAS: (Brookes et al., 2019) The procedure was started from the VAE which had been pre-trained on the high-fitness dataset.It was executed for 150 rounds and, depending on design task, either 100 or 1, 000 sequences were sampled and used for weighted re-training at the end of each round (whichever resulted in higher fitness scores).The threshold was set to either the 60th or 80th pecentile of fitness scores predicted on the training data (whichever resulted in more stable fitness score trajectories).
The VAE was trained for either 1 or 10 epochs at the end of each round (whichever resulted in more stable fitness scores -for some tasks, the fitness scores would drop abruptly after only a few sampling rounds when training the VAE for 10 epochs per round).For the benchmark comparison in the main paper, the standard deviation of the predictions were set to a small constant value ranging between 0.02 and 0.1, depending on application (since none of the pre-trained oracles APARENT, MPRA-DragoNN or Optimus 5' predicts deviation, we used a small constant deviation that was ∼ 50x smaller than the maximum possible predicted value).In the Supplementary Information, where we use oracles with uncertainty estimation, we also supplied the predicted standard deviation to the CbAS survival function.The code was adapted from: https://github.com/dhbrookes/CbAS/.
RWR: (Peters et al., 2007) The procedure was started from the VAE which had been pre-trained on the high-fitness dataset.It was executed for 150 rounds and 100 or 1000 sequence samples were used for weighted re-training at the end of each round (whichever resulted in higher fitness scores).The VAE was trained for 10 epochs each round.The code was adapted from: https://github.com/dhbrookes/CbAS/.
AM-VAE: (Killoran et al., 2017) This method performs activation maximization by gradient ascent through a pre-trained VAE in order to design sequences.The procedure was started from the VAE which had been pre-trained on the high-fitness dataset.Each sequence was optimized for 2, 000-5, 000 updates depending on design task (using the Adam optimizer).A normally distributed noise term was added to the gradients to help overcome potential local minima.The code was adapted from: https://github.com/dhbrookes/CbAS/.
FB-GAN: (Gupta et al., 2019) The FB-GAN procedure was started from the W-GAN which had been pre-trained on a random sample of sequences.The method was executed for 150 epochs and 960 sequences were sampled and used for feedback at the end of each epoch.We either set the feedback threshold to a fixed value (the 80th percentile of fitness scores predicted on the high-fitness dataset), or we adaptively re-set the threshold to a certain percentile as measured on the 960 sampled sequences at the end of each epoch.The code was adapted from: https://github.com/av1659/fbgan.
FB-VAE: (Gupta et al., 2019) A VAE-based version of the FB-GAN.The procedure was started from the VAE which had been pre-trained on the high-fitness dataset.It was executed for 150 epochs and 100 or 1000 sequence samples were used for feedback at the end of each epoch (whichever resulted in higher fitness scores).A fixed threshold was used (either the 60th or 80th percentile as predicted on the high-fitness data).The code was adapted from: and use a shared scale γ and offset β for all M channels.

Figure 2 :
Figure 2: Example softmax sequences (PSSMs) generated by the PWM and Fast SeqProp methods after 20,000 updates of gradient ascent updates with default optimizer parameters (Adam).The logit matrices l were uniformly randomly initialized prior to optimization.Identified cis-regulatory motifs annotated above each sequence.

Figure 3 :
Figure 3: (A) Top: VAE-regularized Fast SeqProp.A variational autoencoder (VAE) is used to control the estimated likelihood of designed sequences during gradient ascent optimization.Bottom: Estimated VAE log likelihood distribution of random sequences (green), test sequences from the MPRA-DragoNN dataset (orange) and designed sequences (red), using Fast SeqProp without and with VAE regularization (top and bottom histogram respectively).(B) Oracle fitness score trajectories (APARENT, MPRA-DragoNN and Optimus 5') and validation model score trajectories (DeeReCT-APA, iEnhancer-2L and retrained Optimus 5') as a function of the cumulative number of predictor calls made during the sequence design phase.Shown are the median scores across 10 samples per design method, for three repeats.(C) Example designed sequences for APARENT, MPRA-DragoNN and Optimus 5', using Fast SeqProp with and without VAE-regularization.Oracle and validation model scores are annotated on the right.

Figure 4 :
Figure 4: (A) Protein sequences are designed to minimize the KL-divergence between predicted and target distance and angle distributions.The one-hot pattern is used for two of the trRosetta inputs.(B) Generating sequences which conform to the target predicted structure of a Sensor Histidine Kinase.Simulated Annealing was tested at several initial temperatures, with 1 substitution per step.Similarly, SeqProp and Fast SeqProp was tested at several combinations of learning rate and momentum.(C) Predicted residue distance distributions after 200 iterations.

Figure
Figure S1: (A) In Simulated Annealing, mutations are accepted with a temperature-controlled probability even if the predicted fitness decreases.(B) Maximizing DragoNN SPI1.Simulated Annealing is tested at several parameter configurations (number of substitutions per step / initial temperature).(C) Maximizing MPRA-DragoNN using either (top) PWM or (bottom) Fast SeqProp as the design method.Shown are the optimization scores during the design phase (green) when we pass either the softmax sequence (PWM) or a sampled sequence (Fast SeqProp) to the predictor, and the corresponding scores if we were to sample sequences from the softmax representation (orange).(D) Maximizing Optimus 5' and DragoNN with a softmax sequence entropy penalty.(E) Comparing Fast SeqProp to a version of the same method using the Gumbel distribution ('Gumbel' -Gumbelsampling, 'Gumbel-IN' -Gumbel-sampling with instance norm.).(F) Maximizing APARENT using Fast SeqProp or PWM with SGD (LR = Learning Rate).(G) Maximizing DragoNN using Fast SeqProp, with either the softmax-or original (simple) straight-through estimator.1x and 10x refer to the number of sequences sampled at each update.

Figure
Figure S3: (A) Estimated VAE log likelihood distribution of random sequences (green), test sequences from the APARENT dataset (orange) and designed sequences (red), using Fast SeqProp without (top) and with VAE regularization (bottom).(B) Sequence diversity of each design methods as a function of predictor calls, measured as the median relative edit distance between sequence samples per method, across three repeats.(C) Median predicted relative use of the designed polyadenylation signals for each method, using the DeepPASTA model.Median score reported across three repeats.(D) Three predicted types of validation scores using the EnhancerP-2L model: (left) The mean frequency of identified functional enhancers, (middle) the mean weighted frequency of identified enhancers (weak enhancers are worth 0.5, strong enhancer 1.0), and (right) Predicted EnhancerP-2L P-scores.The means are computed across three repeats.(E) Additional benchmark analysis using both VAE-regularization and a probabilistic oracle ensemble capable of estimating uncertainty (Lakshminarayanan et al., 2017; see Appendix C for details).Top: Median oracle fitness scores across three repeats for the tasks: pA signal design, 5' UTR design and GFP variant design.Bottom: Median validation scores across three repeats, using DeeReCT-APA, Optimus 5' (retrained) and a GP regression model respectively.(F) Maximizing the difference in transcriptional activity between K562 and HepG2.Shown are the loss and an example designed sequence when running Fast SeqProp without regularization (orange) and when applying both VAE-and Activity-regularization (green).