Elastic network models
A unique collection of multiple ENM force fields is now provided within Bio3D. These include the popular anisotropic network model (ANM) [19], the associated parameter-free ANM [20], and a more sophisticated C-alpha force field derived from fitting to the Amber94 all-atom potential [21]. Also included is the REACH force field employing force constants derived from MD simulations [22], and a recent parameterization providing sequence-specific force constants obtained from an ensemble of 1500 NMR structures [23]. A convenient interface for the application of user-defined force fields is also provided enabling customized normal mode calculations, perturbation analysis, and other more advanced options as detailed online and in Additional file 1.
All implemented ENMs considered here employ a harmonic potential, where the potential energy between residues i and j is given by:
where r is the current protein conformation, r 0 represents the equilibrium conformation, and ‖r
ij
‖ the distance between residues i and j [24],[25]. By default, the Bio3D package employs the C-alpha force field [21] derived from fitting to the Amber94 all-atom potential with pair force constants given by
with units of k(r) given in kJ mol− 1 Å− 2. The selection of different force fields is described in detail both online and in Additional file 1.
Ensemble NMA
Integrated multiple sequence and structural alignment methods are utilized to facilitate the analysis of structures of unequal composition and length. From these alignments, equivalent atom positions across structure ensembles are identified and normal mode vectors determined by calculating the effective force-constant Hessian matrix as
where K
AA
represents the sub-matrix of K corresponding to the aligned C-alpha atoms, K
QQ
for the gapped regions, and K
AQ
and K
QA
are the sub-matrices relating the aligned and gapped sites [21],[26]. The normal modes of the individual structure in the ensemble can then be obtained by solving the eigenvalue problem
where V is the matrix of eigenvectors and λ the associated eigenvalues.
Ensemble PCA
Principal component analysis can be performed on any structure dataset of equal or unequal sequence composition and length to capture and characterize inter-conformer relationships. The application of PCA to both distributions of experimental structures and MD trajectories, along with its ability to provide considerable insight into the nature of conformational differences in a range of protein families has been previously discussed [27]-[30]. Briefly, PCA is based on the diagonalization of the covariance matrix, C, with elements C
ij
calculated from the aligned and superimposed Cartesian coordinates, r, of equivalent Cα atoms:
where i and j enumerate all 3 N Cartesian coordinates (N is the number of atoms), and 〈r〉 denotes the ensemble average. Projection of the distribution onto the subspace defined by the PCs with the largest eigenvalues provides a low-dimensional representation of the structures facilitating inter-conformer analysis.
Similarity measures
Multiple similarity measures have been implemented to provide an enhanced framework for the assessment and comparison of ensemble NMA and PCA. These measures also facilitate clustering of proteins based on their predicted modes of motion:
Root mean square inner product (RMSIP) measures the cumulative overlap between all pairs of the l largest eigenvectors [31], and is defined as:
where and represent the ith and jth eigenvectors obtained from protein A and B, respectively. l is the number of modes to consider which is commonly chosen to be 10. The RMSIP measure varies between 0 (orthogonal) and 1 (identical directionality).
Covariance overlap provides a measure of the correspondence between the eigenvectors (v
i
) similar to the RMSIP measure, but also includes weighing by their associated eigenvalues (λ
i
) [32]. It ranges from 0 (orthogonal) to 1 (identical), and is defined as:
Bhattacharyya coefficient provides a means to compare two covariance matrices derived from NMA or an ensemble of conformers (e.g. simulation or X-ray conformers). For ENM normal modes the covariance matrix (C) can be calculated as the pseudo inverse of the mode eigenvectors:
where v
i
represents the ith eigenvector, λ
i
the corresponding eigenvalue, and N the number C-alpha atoms in the protein structure (3 N-6 non-trivial modes). As formulated by Fuglebakk et al. [26],[33], the Bhattacharyya coefficient can then be written as
where Q is the matrix of the principal components of (C A + C B)/2, Λ is diagonal matrix containing the corresponding eigenvalues, and q the number of modes needed to capture 90% of the variance of Q. The Bhattacharyya coefficient varies between 0 and 1, and equals to 1 if the covariance matrices (C A and C B) are identical.
Squared Inner Product (SIP) measures the linear correlation between two atomic fluctuation profiles [33],[34]. It varies between 0 and 1 and is defined as
where w
A
and w
B
w
B
are vectors of length N containing the fluctuation value (e.g. RMSF) for each atom in protein A and B, respectively.
PCA of cross-correlation and covariance matrices
New functionality facilitates PCA of residue-residue cross-correlations and covariance matrices derived from ensemble NMA. This analysis can be formulated as
where Υ is a matrix containing the elements of the M correlation/covariance matrices (with one row per structure), B the eigenvectors and Γ the associated eigenvalues. Projection into the sub-space defined by the largest eigenvectors enables clustering of the structures based on the largest variance within the cross-correlation or covariance matrices.
All similarity measures described above can be utilized for clustering the ensemble of structures based on their normal modes. Various clustering algorithms are available, such as k-means clustering, as well as hierarchical clustering using the Ward’s minimum variance method, or single, complete and average linkage. The application and comparison of the described similarity measures is presented in Additional file 2.
Force constants variance weighting
We propose to incorporate knowledge on the accessible conformational ensemble (e.g. all available X-ray structures) to lift the dependency of the force constants on the single structure they were derived from. We weigh the force constants with the variance of the pairwise residue distances derived from the ensemble of structures. The weights (W
ij
) and the modified force constants (k’
ij
(r)) can then be calculated as
where S
ij
(the elements of matrix S) represents the variance of the distance between residues i and j in the ensemble, ŝ is the maximum of such variance for any pair of atoms, and φ is an optional scaling factor. The application of force constant weighting is presented in Additional file 1.
Identification of dynamic domains
Analysis and identification of dynamic domains, i.e. parts of the molecule that move as relatively rigid entities within a conformational ensemble, is made available through a new implementation of the GeoStaS algorithm previously presented as a standalone Java program [35]. The algorithm relies on the identification of the best pairwise superimposition of atomic trajectories based on rotation and translation in quaternion space. The resulting atomic movement similarity matrix provides a means for clustering the atoms in the system based on their respective similarity. This approach has the advantage of capturing the potential correlation in rotational motions of two atoms placed on opposite sites, which may otherwise be found to be anti-correlated in a standard cross-correlation analysis. The application of GeoStaS is demonstrated in Additional files 1 and 2 for both single structure and ensemble NMA, as well as for ensembles of PDB structures and MD trajectories.
Correlation network analysis
Correlation network analysis can be employed to identify protein segments with correlated motions. In this approach, a weighted graph is constructed where each residue represents a node and the weight of the connection between nodes, i and j, represents their respective cross-correlation value, c
ij
, expressed by either the Pearson-like form [36], or the linear mutual information [37]. Here we propose an approach related to that introduced by Sethi et al. [38], but using multiple correlation matrices derived from the input ensemble instead of contact maps. Specifically, the correlation matrix (C) is calculated for each structure in the ensemble NMA. Then, edges are added for residue pairs with c
ij
≥ c 0 across all experimental structures, where c 0 is a constant. In addition, edges are added for residues where c
ij
≥ c 0 for at least one of the structures and the Cα-Cα distance, d
ij
, satisfies d
ij
≤ 10 Å for at least 75% of all conformations. Edges weights are then calculated with − log(〈c
ij
〉), where 〈 ⋅ 〉 denotes the ensemble average. Girvan and Newman betweeness clustering [39] is then performed to generate aggregate nodal clusters, or communities, that are highly intra-connected but loosely inter-connected. Visualization of the resulting network and community structures in both 2D and 3D along with additional clustering and analysis options are also provided. See Additional file 4 for a complete example of the integration of ensemble NMA with correlation network analysis.