Skip to main content

Table 2 Sample code to guide iterative CPD workflows

From: rstoolbox - a Python library for large-scale analysis of computational protein design data and structural bioinformatics

Action Code Sample
Load import rstoolbox as rs
import matplotlib.pyplot as plt
import seaborn as sns
Read # Load design population. A description dictionary can be provided to alter the
# information loaded from the silent file. In this case, we load all the
# sequence information available for all possible chains in the decoys.
df = rs.io.parse_rosetta_file(‘1kx8gen2.silent.gz’, {‘sequence’: ‘*’})
# Select the top 5% designs by score and obtain the residues
# overrepresented by more than 20%
df_top = df[df[‘score’] < df[‘score’].quantile(0.05)]
freq_top = rs.analysis.sequential_frequencies(df_top, ‘A’, ‘sequence’, ‘protein’)
freq_all = df .sequence_frequencies (‘A’) # shortcut to utils.sequential_frequencies
freq_diff = (top - freq)
muts = freq_diff[(freq_diff.T > 0.20).any()].idxmax(axis = 1)
muts = list(zip(muts.index, muts.values))
# Select the best scored sequence that does NOT contain ANY of those residues
pick = df .get_sequence_with (‘A’, muts, confidence = 0.25,
                                                                                  invert = True).sort_values(‘score’).iloc[:1]
# Setting a reference sequence in a DesignFrame allows to use this sequence as
# source for mutant generation and sequence comparison, amongst others.
seq = pick.iloc[0 ].get_sequence (‘A’)
pick .add_reference_sequence (‘A’, seq)
# Generate mutants based on the identified overrepresented variants:
# 1. Create a list with positions and residue type expected in each position
muts = [(muts[i][0], muts[i] [1] + seq[muts[i][0] - 1]) for i in range (len(muts))]
# 2 Generate a DesignFrame containing the new expected sequences
variants = pick .generate_mutant_variants (‘A’, muts)
variants .add_reference_sequence (‘A’, seq)
# 3. Generate the resfiles that will guide the mutagenesis
variants = variants .make_resfile (‘A’, ‘NATAA’, ‘mutants.resfile’)
# 4. With Rosetta installed, we can automatically run those resfiles.
variants = variants .apply_resfile (‘A’, ‘variants.silent’)
variants = variants .identify_mutants (‘A’)
Plot fig = plt.figure(figsize = (170 / 25.4, 170 / 25.4))
grid = (3, 4)
# Visualize overrepresented residues in the top 5%
ax = plt.subplot2grid(grid, (0, 0), colspan = 4, rowspan = 4)
cbar_ax = plt.subplot2grid(grid, (4, 0), colspan = 4, rowspan = 1)
sns.heatmap(freq_diff.T, ax = ax, vmin = 0, cbar_ax = cbar_ax)
rs.utils.add_top_title (ax, ‘Top scoring enrichment’)
# Compare query positions: initial sequence vs. mutant generation
ax = plt.subplot2grid(grid, (5, 0), colspan = 2, rowspan = 2)
key_res = [mutants[0] for mutants in muts]
rs.plot.logo_plot_in_axis (pick, ‘A’, ax = ax, _residueskr)
ax = plt.subplot2grid(grid, (5, 2), colspan = 2, rowspan = 2)
rs.plot.logo_plot_in_axis (variants, ‘A’, ax = ax, key_residues = kr)
# Check which mutations perform better
ax = plt.subplot2grid(grid, (7, 0), colspan = 2, rowspan = 3)
sns.scatterplot(‘mutant_count_A’, ‘score’, data = variants, ax = ax)
# Show distribution of best performing decoys
ax = plt.subplot2grid(grid, (7, 2), fig = fig, colspan = 2, rowspan = 3)
rs.plot.logo_plot_in_axis (variants.sort_values(‘score’).head(3), ‘A’, ax = ax, key_residues = kr)
plt.tight_layout()
plt.savefig(‘BMC_Fig3.png’, dpi = 300)
  1. This example shows how to find overrepresented residue types for specific positions in the top 5% scored decoys of a design population, and use those residue types to bias the next design generation, thus creating a new, enriched second generation population. Code comments are presented in italics while functions from rstoolbox are highlighted in bold. Styling commands are skipped to facilitate reading, but can be found in the repository’s notebook.