Skip to main content

Table 4 Sample code for the comparison between 4 different decoy populations

From: rstoolbox - a Python library for large-scale analysis of computational protein design data and structural bioinformatics

Action

Code Sample

Load

import pandas as pd

import rstoolbox as rs

import matplotlib.pyplot as plt

Read

df = []

# With Rosetta installed, scoring can be run for a single structure

baseline =  rs.io.get_sequence_and_structure (‘4yod.pdb’)

experiments = [‘no_target’, ‘static’, ‘pack’, ‘packmin’]

scores = [‘score’, ‘LocalRMSDH’, ‘post_ddg’, ‘bb_clash’]

scorename = [‘score’, ‘RMSD’, ‘ddG’, ‘bb_clash’]

for experiment in experiments:

  # Load Rosetta silent file from decoy generation

  ds = rs.io.parse_rosetta_file(experiment + ‘.design’)

  # Load decoy evaluation from a pre-processed CSV file.

  # Casting pd. DataFrame into DesignFrame is as easy as shown here.

  ev = rs.components. DesignFrame(pd.read_csv(experiment + ‘.evals’))

  # Different outputs for the same decoys can be combined through

  # their ‘description’ field (decoy identifier)

  df.append(ds.merge (ev, on = ‘description’))

  # Tables can be joined together into a single working object

  df = pd.concat(df)

  # As we are comparing over BINDI’s sequence, that is our reference.

  df.add_reference_sequence(‘B’, baseline.iloc[0].get_sequence(‘B’)[:-1])

Plot

fig = plt.figure (figsize = (170 / 25.4, 170 / 25.4))

grid = (12, 4)

# Show the distribution for key score terms

axs =  rs.plot.multiple_distributions (df, fig, grid, values = scores, rowspan = 3,

labels = scorename, x = ‘binder_state’, order = experiments, showfliers = False)

# Sequence score for a selected decoys with standard-matrix weights

ax = plt.subplot2grid(grid, (3, 0), fig = fig, colspan = 4, rowspan = 4)

qr = df[df[‘binder_state’] == ‘no_target’].sort_values(‘score’).iloc[0]

rs.plot.per_residue_matrix_score_plot ( qr , ‘B’, ax, ‘BLOSUM62’, add_alignment = False, color = 0)

qr = df[df[‘binder_state’] == ‘no_pack’].sort_values(‘score’).iloc[0]

rs.plot.per_residue_matrix_score_plot (qr, ‘B’, ax, ‘BLOSUM62’, add_alignment = False, color = 2,

selections = [(‘43–64’, ‘red’)])

# Small functions help edit the plot display

rs.utils.add_top_title (ax, ‘no_target (blue) - pack (green)’)

# Evaluate the variability of residue types in the binding region

ax = plt.subplot2grid(grid, (7, 0), fig = fig, colspan = 2, rowspan = 4)

qr = df[df[‘binder_state’] == ‘no_target’]

rs.plot.sequence_frequency_plot (qr, ‘B’, ax, key_residues = ‘43–64’, cbar = False, clean_unused = 0.1, xrotation = 90)

rs.utils.add_top_title (ax, ‘no_target’)

ax = plt.subplot2grid(grid, (7, 2), fig = fig, colspan = 2, rowspan = 4)

ax_cbar = plt.subplot2grid(grid, (11, 0), fig = fig, colspan = 4)

rs.plot.sequence_frequency_plot (df[df[‘binder_state’] == ‘pack’], ‘B’, ax, key_residues = ‘43–64’,                                                                                                 cbar_ax = ax_cbar, clean_unused = 0.1, xrotation = 90)

rs.utils.add_top_title (ax, ‘pack’)

plt.tight_layout()

plt.savefig(‘BMC_Fig5.png’, dpi = 300)

  1. The code shows how to join data from multiple Rosetta experiments to assess the key difference between four design populations in terms of different scoring metrics and sequence recovery. Code comments are presented in italics while functions from the rstoolbox are highlighted in bold. Styling commands are skipped to facilitate reading, but can be found in the repository’s notebook.