We posited that the CVAE (described in “Methods” section) encoding would result in a model that can automatically capture biophysically relevant features from the simulation datasets. We used three model protein folding systems, namely Fs-peptide, villin headpiece (VHP) and BBA to demonstrate that the CVAE can learn a biophysically relevant latent space that corresponds to folding reaction coordinates, including fraction of native contacts and root mean squared deviation (RMSD) to the native state. To calculate the fraction of native contacts we use a definition similar to Savol and Chennubhotla [33]. Native contacts are based on a distance cut off of 8 Å or less between between C ^{α} atoms and at least 75% of conformations remain within an RMSD cut-off of 1.1 Å of the native structure. First, we evaluate the ability of CVAE to learn a reduced dimensional space given the MD simulation data. Second, we show that the CVAE latent space corresponds to biophysically relevant features for each of the folding simulations studied. Finally, we demonstrate that the CVAE latent features can be transferred to other simulations, making it generalizable to a particular protein type.

### Reconstruction quality of CVAE on protein folding trajectories

In order to evaluate the CVAE reconstruction quality from the protein folding trajectories, we first examined the overall loss (\(\mathcal {L}\)) of the CVAE over the training epochs (Eq. 1).

$$\begin{array}{*{20}l} \mathcal{L} &= \mathcal{E}_{r} + \mathcal{E}_{l} \end{array} $$

(1)

$$\begin{array}{*{20}l} \mathcal{E}_{r} &= - \frac{1}{n} \sum_{i=1}^{n} X_{i} \text{log}(f(z_{i})) \end{array} $$

(2)

$$\begin{array}{*{20}l} \mathcal{E}_{l} &= \text{KL}(z||Normal(0,1)) \end{array} $$

(3)

\(\mathcal {L}\) is characterized as the sum of the reconstruction loss \(\mathcal {E}_{r}\) (Eq. 2) and the latent loss \(\mathcal {E}_{l}\) (Eq. 3). The reconstruction loss measures how well the CVAE can reconstruct the original input contact matrices (consisting of *n* conformations from the trajectory) and is calculated as the cross entropy loss between *f*(*z*), which indicates the reconstructed probability of contact between two C ^{α} atoms, and the original *X* contact maps from the simulation, which indicate the existence of contact between two C ^{α} atoms. The latent loss is a regularizing constraint that forces the latent embeddings *z* to conform to a Gaussian distribution; this is calculated as the Kullback-Leibler (KL) divergence between the latent embeddings *z* and a Normal distribution with mean 0 and standard deviation 1.

For the three protein folding trajectories in this study (Fig. 2) a–c, we observed that the overall loss, \(\mathcal {L}\), stabilizes over the training epochs, showing that it converges over time. We observed that for each protein, the number of training epochs needed to reach convergence is different; this is not surprising, given that the size of these proteins are different and the trajectories have unique folding pathways. Furthermore, we observed that the reconstruction loss (described in Eq. 1) is also different for each protein system – indicating that the quality of CVAE reconstruction is unique to each protein system.

We next examined whether the CVAE latent features can faithfully reconstruct the original data. To evaluate this, we used the reconstruction difference that simply measures the difference between the reconstructed and original contact matrices. Note that while the original contact matrix is typically binary (indicating presence or absence of a contact), the output from the CVAE is a value between 0 and 1, which may be interpreted as a likelihood of observing a contact between two C ^{α} atoms. We plotted a histogram of reconstruction from the testing data, shown in Fig. 2d. The reconstruction difference varies between -1 and +1, which indicates whether the reconstructed data mispredicts the presence or absence of a contact respectively. We choose a nominal threshold of 10% of the original value to indicate misprediction. For the Fs-peptide simulations, the CVAE is able to faithfully reconstruct nearly 88% of all the observed contacts and mispredicts only 12% of contacts (Fig. 2d). We note that these contacts are at the interfaces of secondary structural elements, between *α*-helices, or between *α*-helices and *β*-strands. We can make similar observations for the other protein systems; the average reconstruction error for VHP is about 10.6% (Fig. 2e). For BBA (Fig. 2f), the CVAE reconstruction can recover nearly 88.5% of all contacts correctly in the folding simulations.

We evaluated the performance of CVAE as a function of several model hyperparameters using Bayesian optimization [34–36]. The search bounds and optimal results for the hyperparameters are summarized in Table 1. While the optimal settings for the latent dimension for each molecule was found to be near ten, we chose to use models with latent dimensions of size three. Since it is possible to verify visually that the autoencoder is meaningfully capturing the folding process without sacrificing much in terms of reconstruction error, we used a three dimensional latent space for each of the protein systems. To meaningfully visualize the CVAE latent representation, we chose the t-distributed stochastic neighborhood embedding (t-SNE) [37] method. There are many choices for visualizing the latent space, including techniques such as mixture of Gaussians, k-means clustering – however, for this paper, t-SNE provided a practical way to visualize the CVAE latent space in a meaningful manner.

The mean reconstruction loss over various settings for the latent dimension for each of the protein folding trajectories can be seen in Fig. 3. When considering the latent dimension, there exists a trade-off between the model’s ability to compress information and its ability to minimize reconstruction error. For example, we note that the choice of the optimization technique used for the training process affects the model performance. To illustrate this, we examined four different optimizers: namely, RMSProp, ADAM, ADAMax, and ADAgrad (Fig. 4). For each of these techniques, we tracked the reconstruction loss (\(\mathcal {L}\)) with both the training and testing data. As illustrated in Fig. 4, we found that the RMSprop optimizer (black line) performs the best compared to the other three optimizers with respect to the testing data. Further, we find that the model’s performance can be affected by the interactions between the choice of latent dimension and other model hyperparameters.

### CVAE reveals folding intermediates of Fs-peptide

Fs-peptide is often used as a model system to study protein folding processes; here the final state of the peptide is characterized by a fully *α*-helical state. In this paper, we examined whether our CVAE can recapitulate the diverse *α*-helical intermediate states in an unsupervised manner. Figure 5a shows the latent space learned by the CVAE. Each conformation from the training data is represented as a three-dimensional coordinate in the latent space. To understand whether the latent space captured by the CVAE describes the folding process, we colored each conformer with the corresponding RMSD to the native state. The RMSD to the native state is often used as a conformational coordinate to track protein folding trajectories [38]. We note that the input to the CVAE is only the raw contact maps; however, the model is able to distinguish between low and high RMSD conformers when projected onto the latent space.

Within the latent space, we note the presence of distinct pockets with high RMSD values to the native state (red colors), which converge eventually into folded state (blue colors). The gradation of the colors along the arms of the CVAE axes indicates that the latent space (*z*_{1}−*z*_{3}) is able to describe the folding process. It indicates multiple pathways along which Fs-peptide folds into its final state. Although the CVAE-determined latent space can capture the presence of both folded/unfolded states (quantified by the total number of contacts), it is still challenging to interpret. Hence, we used t-SNE to visualize the results. We painted each conformation in the t-SNE with the RMSD values (Fig. 5b) and the fraction of native contacts (Fig. 5c). The t-SNE approach allows us to identify distinct conformational clusters observed from the simulations, labeled (i) to (viii), in the folding trajectories. In particular, we find the presence of partially folded *α*-helical bundles as well as a fully formed *α*-helix, which represents the folded state of the protein. Additionally, we also find that the different folded states are separated and connected via multiple intermediate states, all of which have relatively lower number of total contacts. This indicates that for the transitions between the folded microstates, the peptide must undergo several unfolding events.

Interestingly, our approach also reveals the presence of potentially misfolded states in these trajectories. In this work, we consider a misfolded state to be a set of conformations that share higher fraction of native contacts, but have a high RMSD from the native state ensemble of the protein. For example, state (viii) in Fig. 5c shows the presence of conformations that have higher fraction of native contacts (close to 0.95), however, its secondary structure content is significantly different from the native state, highlighted as (vii) in Fig. 5c. The intermediate states identified here have differences in their secondary structural content, i.e., the number of *α*-helical turns as depicted in states labeled (i), (ii), (iv), (vi) and (viii) along with differences in the extent to which the N- and C-terminal ends of the protein are folded (for e.g., state labeled (ii) folds from the N-terminal end versus state labeled (iii) folds from the C-terminal end).

We can also visualize the tSNE dimensions as the logarithm of the histograms as a simple estimate of the free energy surface as depicted in Fig. 5d, where by conformational states can be visualized. This representation is only for visual purposes and as such can be used for qualitative insights into the organization of the folding energy landscape of Fs-peptide. The native state of the protein, labeled (vii), consists of the fully folded peptide, while many of the partially folded states and their intermediates are distributed around the periphery of this landscape. It is interesting to note that the contours represent conformational states that correspond to folding coordinates and each of the states are marked (using solid lines) as to where they belong on this landscape.

### CVAE reveals conformational states in the VHP folding pathway

For the VHP simulations, we were able to identify a similar distribution of folded/unfolded conformations along its folding pathway (Fig. 6). Even though the reconstruction error plots from Fig. 3b indicate that the ideal number of latent dimensions is 9, we examined whether a low dimensional encoding with just three dimensions is able to capture folding events within this trajectory. Similar to the analysis of the Fs-peptide folding trajectories, the latent embedding of the CVAE reveals the presence of folded and unfolded conformations that are separated by a large number of intermediate states (Fig. 6a). Since these simulations were carried out at a higher temperature (360 K), these simulations indicate larger fluctuations in the secondary structures of VHP. Further, within the course of the simulations, a total of 34 folding events are summarized, which indicate a large number of conformational states actually correspond to folded conformations.

To enable interpretation of the VHP folding landscape, we projected the CVAE latent dimensions using t-SNE and observed that the folded states of VHP are separated into three distinct ‘wells’ that correspond to the folding events along this trajectory. The evidence for the folding events emerges from painting the t-SNE landscape with the fraction of native contacts (Fig. 6c). A large portion of the trajectory is either unfolded (e.g., states labeled (ii), (v) indicated as conformational ensembles along the trajectory) or partially folded, i.e., showing the presence of all the three helices, but with different number of helical turns (e.g., states labeled (iii) and (vi) in Fig. 6). Finally, the folded states labelled (i) and (iv) capture distinct orientations of *α*-helices as observed from the figure. It is interesting to note that the transition from one folded state to the other involves partial unfolding (similar to Fs-peptide). Further, we also note that the partially folded state (iii) consists of many native contacts; however, this state does not have all the three helices and may represent an unfolded intermediate state through which the transition to either of the two folded states may occur. The simple histogram representation of the t-SNE coordinates (Fig. 6d) provides an easy way to interpret the different conformational states with respect to the folded states in the trajectory (states (i) and (iv)).

### CVAE analysis of BBA folding simulations can be transferred to learn folding patterns across trajectories

We next examined whether the CVAE learned features could be used to predict conformational states from a completely different trajectory. To facilitate this analysis, we used the BBA simulations (see “Methods” section) as a prototype example. Our experimental set up included training the CVAE on the first long time-scale trajectory (223 *μ*s) and predicting if it captures the folding events from the second trajectory (102 *μ*s). As depicted in Fig. 7a, the three latent CVAE dimensions capture the presence of multiple folded conformational states (labeled (i), (iii) and (iv) in Fig. 7b using t-SNE). These states are separated by an intermediate state labeled (ii) and an unfolded state labeled (vi). Finally, it is interesting to note that latent space also characterizes a misfolded state, labeled (v), which shows the presence of an extended *β*-strand.

Using the same model that was trained on the first trajectory, we can project the conformations from the second, shorter simulation onto the latent space learned to test if the folded/unfolded states are separated. As shown in Figs. 7c and 7d, the latent space from the second trajectory clearly shows a separation between the folded states labeled (ii) in Fig. 7d, partially folded states labeled (i) and (iii) in Fig. 7d, and unfolded states labeled (iv) and (v) in Fig. 7d. We also observed that the latent space reconstruction difference is on par with the original model, implying that the features learned by the CVAE can indeed be transferred.