Integrating protein structural dynamics and evolutionary analysis with Bio3D

Background Popular bioinformatics approaches for studying protein functional dynamics include comparisons of crystallographic structures, molecular dynamics simulations and normal mode analysis. However, determining how observed displacements and predicted motions from these traditionally separate analyses relate to each other, as well as to the evolution of sequence, structure and function within large protein families, remains a considerable challenge. This is in part due to the general lack of tools that integrate information of molecular structure, dynamics and evolution. Results Here, we describe the integration of new methodologies for evolutionary sequence, structure and simulation analysis into the Bio3D package. This major update includes unique high-throughput normal mode analysis for examining and contrasting the dynamics of related proteins with non-identical sequences and structures, as well as new methods for quantifying dynamical couplings and their residue-wise dissection from correlation network analysis. These new methodologies are integrated with major biomolecular databases as well as established methods for evolutionary sequence and comparative structural analysis. New functionality for directly comparing results derived from normal modes, molecular dynamics and principal component analysis of heterogeneous experimental structure distributions is also included. We demonstrate these integrated capabilities with example applications to dihydrofolate reductase and heterotrimeric G-protein families along with a discussion of the mechanistic insight provided in each case. Conclusions The integration of structural dynamics and evolutionary analysis in Bio3D enables researchers to go beyond a prediction of single protein dynamics to investigate dynamical features across large protein families. The Bio3D package is distributed with full source code and extensive documentation as a platform independent R package under a GPL2 license from http://thegrantlab.org/bio3d/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0399-6) contains supplementary material, which is available to authorized users.

1 The latest version of the package, full documentation and further vignettes (including detailed installation instructions) can be obtained from the main Bio3D website: http://thegrantlab.org/bio3d/ 2 This vignette contains executable examples, see help(vignette) for further details.

Search and retrieve DHFR structures
Below we perform a blast search of the PDB database to identify related structures to our query E.coli DHFR sequence. In this particular example we use function get.seq() to fetch the query sequence for chain A of the PDB ID 1RX2 and use this as input to blast.pdb(). Note that get.seq() would also allow the corresponding UniProt identifier.

library(bio3d)
aa <-get.seq("1rx2_A") blast <-blast.pdb ( The Blast search and subsequent filtering identified a total of 101 related PDB structures to our query sequence. The PDB identifiers of this collection are accessible through the pdb.id attribute to the hits object (hits$pdb.id). Note that adjusting the cutoff argument (to plot.blast()) will result in a decrease or increase of hits.
We can now use function get.pdb() and pdbslit() to fetch and parse the identified structures. Finally, we use pdbaln() to align the PDB structures.

Annotate collected PDB structures
Function pdb.annotate() provides a convenient way of annotating the PDB files we have collected. Below we use the function to annotate each structure to its source species. This will come in handy when annotating plots later on: anno <-pdb.annotate(ids) print(unique(anno$source))

Principal component analysis
A principal component analysis (PCA) can be performed on the structural ensemble (stored in the pdbs object) with function pca.xyz(). To obtain meaningful results we first superimpose all structures on the invariant core (function core.find()).

Fluctuation analysis
Comparing

Compare with MD simulation
The above analysis can also nicely be integrated with molecular dynamics (MD) simulations. Below we read in a 5 ns long MD trajectory (of PDB ID 1RX2). We visualize the conformational sampling by projecting the MD conformers onto the principal components (PCs) of the X-ray ensemble, and finally compare the PCs of the MD simulation to the normal modes:   # compare MD-PCA and X-rayPCA r <-rmsip(pc.md, pc.xray)

Domain analysis with GeoStaS
Identification of regions in the protein that move as rigid bodies is facilitated with the implementation of the GeoStaS algorithm (Romanowska, Nowinski, and Trylska 2012). Below we demonstrate the use of function geostas() on data obtained from ensemble NMA, an ensemble of PDB structures, and a 5 ns long MD simulation. See help(geostas) for more details and further examples.
GeoStaS on a PDB ensemble: Below we input the pdbs object to function geostas() to identify dynamic domains. Here, we attempt to divide the structure into 2 sub-domains using argument k=2. Function geostas() will return a grps attribute which corresponds to the domain assignment for each C-alpha atom in the structure. Note that we use argument fit=FALSE to avoid re-fitting the coordinates since. Recall that the coordinates of the pdbs object has already been superimposed to the identified invariant core (see above

GeoStaS on a MD trajectory:
The domain analysis can also be performed on the trajectory data obtained from the MD simulation (see above). Recall that the MD trajectory has already been superimposed to the invariant core. We therefore use argument fit=FALSE below. We then perform a new PCA of the MD trajectory, and visualize the domain assingments with function mktrj(): gs.md <-geostas(trj, k=2, fit=FALSE) pc.md <-pca(trj, fit=FALSE) mktrj(pc.md, pc=1, chain=gs.md$grps) Figure 10: Visualization of domain assignment using function geostas() on the first five normal modes of the entire ensemble of 82 DHFR structures.

Measures for modes comparison
Bio3D now includes multiple measures for the assessment of similarity between two normal mode objects. This enables clustering of related proteins based on the predicted modes of motion. Below we demonstrate the use of root mean squared inner product (RMSIP), squared inner product (SIP), covariance overlap, bhattacharyya coefficient, and PCA of the corresponding covariance matrices.

Document Details
This document is shipped with the Bio3D package in both R and PDF formats. All code can be extracted and automatically executed to generate Figures and/or the PDF with the following commands: library(rmarkdown) render("Bio3D_nma-dhfr-partI.Rmd", "all")  Figure 15: Dendrogram shows the results of hierarchical clustering of structures based on the PCA of covariance matrices (calculated from NMA). Colors of the labels depict associated conformatial state: green (occluded), black (open), and red (closed). The inset shows the conformerplot (see Figure 2), with colors according to clustering based on PCA of covariance matrices.