Propedia: a database for protein–peptide identification based on a hybrid clustering algorithm

Background Protein–peptide interactions play a fundamental role in a wide variety of biological processes, such as cell signaling, regulatory networks, immune responses, and enzyme inhibition. Peptides are characterized by low toxicity and small interface areas; therefore, they are good targets for therapeutic strategies, rational drug planning and protein inhibition. Approximately 10% of the ethical pharmaceutical market is protein/peptide-based. Furthermore, it is estimated that 40% of protein interactions are mediated by peptides. Despite the fast increase in the volume of biological data, particularly on sequences and structures, there remains a lack of broad and comprehensive protein–peptide databases and tools that allow the retrieval, characterization and understanding of protein–peptide recognition and consequently support peptide design. Results We introduce Propedia, a comprehensive and up-to-date database with a web interface that permits clustering, searching and visualizing of protein–peptide complexes according to varied criteria. Propedia comprises over 19,000 high-resolution structures from the Protein Data Bank including structural and sequence information from protein–peptide complexes. The main advantage of Propedia over other peptide databases is that it allows a more comprehensive analysis of similarity and redundancy. It was constructed based on a hybrid clustering algorithm that compares and groups peptides by sequences, interface structures and binding sites. Propedia is available through a graphical, user-friendly and functional interface where users can retrieve, and analyze complexes and download each search data set. We performed case studies and verified that the utility of Propedia scores to rank promissing interacting peptides. In a study involving predicting peptides to inhibit SARS-CoV-2 main protease, we showed that Propedia scores related to similarity between different peptide complexes with SARS-CoV-2 main protease are in agreement with molecular dynamics free energy calculation. Conclusions Propedia is a database and tool to support structure-based rational design of peptides for special purposes. Protein–peptide interactions can be useful to predict, classifying and scoring complexes or for designing new molecules as well. Propedia is up-to-date as a ready-to-use webserver with a friendly and resourceful interface and is available at: https://bioinfo.dcc.ufmg.br/propedia


Peptide Selection for Metadynamics Validation and System Setup
Four representative peptide-protein complexes were selected from the Propedia output for the crystallographic structure of the Sars-Cov-2 main protease (the M P ro with PDB:ID 6lu7) to posterior binding free energy (∆G bind ) calculation by metadynamics simulation [1,2,3], aiming the validation of the Propedia scores. The peptides were selected looking for a maximal coverage of the Propedia ranking based both on the alignment score as at the RMSD considering the active site (Table S.1).
For the four selected complexes, the Rosetta pose presenting the best score and preserving the previously attributed protonation states was carried to a procedure of topology assembly, solvation on a water box with a 12 Å padding, and Na + /Cl − addition until the system neutralizing and ionic strength of 0.150 M. For these procedures, the respective psfgen, solvate and ionize tools from the VMD/NAMD packages were used [4,5]. The CHARMM36 force field [6,7] was used both for the protein, peptide, water and ions, as well the TIP3P model for the water molecules [8].

Simulation Procedures
All the simulations were carried at the NAMD 2.13 package [5], at the NPT ensemble, using Langevin thermostat and barostat devices, respectively set to 300 K and 1 atm. Periodic boundary conditions were used, as well particle mesh Ewald (PME) for the calculations of the long range electrostatic forces, with a 12 Å cutoff for the nonbond interactions and a 2 fs time-step. The hydrogen atoms dynamics was constrained and estimated by the SETTLE algorithm according implemented in NAMD 2.13 [5].
Before the metadynamics procedures themselves, a meticulous minimization/relaxation/equilibratio protocol was carried for each system. First, each system was minimized for 10,000 steps by conjugate gradient algorithm according implemented in NAMD 2.13 [5].
In sequence, a 10 steps relaxation/equilibration molecular dynamics (MD) protocol, at the previously listed simulation conditions and with gradual adaptation of harmonic restraints, was carried as below: • 500 ps MD with harmonic restrains for all the atoms of the receptor and the ligand.
• 500 ps MD with harmonic restrains just for the backbone atoms of the receptor and the ligand.
• 500 ps MD with harmonic restraints just for the backbone atoms of the receptor.
• 8 ns MD without harmonic restraints and with previous reboot of the velocities according a 300 K and 1 atm NPT ensemble.
• 500 ps MD reintroducing the harmonic restraints at the backbone atoms of the receptor and the ligand.
• 500 ps MD reintroducing the harmonic restraints at all atoms of the receptor and the ligand.
• 300 ps MD with removal of the harmonic restraints from the ligand side chains.
• 300 ps MD with removal of the harmonic restraints from the ligand all atoms.
• 1 ns MD at the aforementioned conditions and rebooting the velocities to a 300 K and 1 atm compatible NPT ensemble.
The last 5 steps were carried in order to prepare the system for the restraints conditions used along the metadynamics procedure (i.e., harmonic restraints for the entire receptor and complete freedom just for the ligand (see below)).
After relaxation, the last frame for each system was taken to three respective and independent 10 ns metadynamics procedures (i.e., three metadynamics simulations per system) of ligand unbind from the active site according a similar protocol to the described in [3]. Basically, all the atoms of the M P ro were maintained harmonically restrained, while complete dynamics freedom was given to the peptide, water molecules and ions. Two collective variables (CVs) were used to describe the unbinding process of the peptide from the catalytic pocket. The first, CV dist was defined as the distance in Å between the respective centers of mass of the catalytic C145 in were saved every 1 ps. Results were analyzed with VMD software [4], in-house R and Python scripts, as well the Wordom package [9].

Free Energy Maps, Projections of the Metadynamics Energies Along the CV dist dimension and Estimation of ∆G bind
To select the number of frames to be considered along the PMF reconstruction, we have used both the molecular mechanics nonbond interaction energies between the peptide and the protein, as the distance associated to CV dist . (i.e., the distance between the respective C145 and the closer residue mass centers) as a metric to observe: 1) the access and the filling of the energy minimum A (i.e., the energy minimum for the peptide inside the protein); 2) the access and the filling of the energy minimum B (i.e., the energy minimum for the peptide outside the protein, at the aqueous environment); 3) the re-crossing event (i.e., the simulation phase in which, once the system has reached and completely filled both minima, the peptide gain higher dynamics freedom and turn to visit both minima repeatedly). Following the suggested in literature, we have selected the PMF maps saved until the simulation step immediately before the re-crossing event to reconstruct the free energy landscape (FEL) along the unbinding event [3,10]. This is made in order to avoid the over-filling of the energy minima by the metadynamics Gaussian potentials and a loss of accuracy along such FEL reconstruction.
The CV most directly related to the unbinding process is, naturally, the one that describes the distance variation between the peptide and the active site (CV dist ). In this way, the projections of the metadynamics free energy onto the CV dist . was calculated similarly to [3] as following equation: −βG CV ang (CV dist ) = ln e −βG(CV dist , CV ang ) dCV ang e −βG(CV dist , CV ang ) dCV ang dCV dist (S.1) where β = 1/kbT, being k b the Boltzmann constant (1,9858 x 10 −3 ·kcal·mol −1 ·K −1 ); T = 300 K and G(CV dist , CV ang ) accounts for the free energy value at the position (CV dist , CV ang ) position on the PMF map.
For the estimation of ∆G bind from each metadynamics replica, the minima value of G CV ang (CV dist ) at a CV dist compatible with the complete independence of the peptide environment from the protein influence (i.e., CV dist ≥ 25 Å) was diminished from the minima value of this measure inside a distance compatible with the peptide bound (specifically or not) to the protein active site (i.e., CV dist ≤ 20 Å) according equation: where G CV ang (CV dist ) is the minimum value inside, while G CV ang (CV dist ) is the minimum outside the protein. For the cases in which two or more minima with similar favorability were found inside the protein, both minima were equally weighted at a global value of G CV ang (CV dist ) (i.e.,G CV ang (CV dist ) M in inside ) according equation:  Table S.1 Correlation between the metadynamics estimated binding free energy (MetaD ∆G bind and it is standard deviations(σ)) and the Propedia recovered alignment score (Align. Score) and site RMSD. At the last two columns the respective negative and positive correlation coefficients of the ∆G bind with each Propedia parameter are depicted.