Learning probabilistic models of hydrogen bond stability from molecular dynamics simulation trajectories
 Igor Chikalov^{1}Email author,
 Peggy Yao^{2},
 Mikhail Moshkov^{1} and
 JeanClaude Latombe^{3}
https://doi.org/10.1186/1471210512S1S34
© Chikalov et al; licensee BioMed Central Ltd. 2011
Published: 15 February 2011
Abstract
Background
Hydrogen bonds (Hbonds) play a key role in both the formation and stabilization of protein structures. They form and break while a protein deforms, for instance during the transition from a nonfunctional to a functional state. The intrinsic strength of an individual Hbond has been studied from an energetic viewpoint, but energy alone may not be a very good predictor.
Methods
This paper describes inductive learning methods to train proteinindependent probabilistic models of Hbond stability from molecular dynamics (MD) simulation trajectories of various proteins. The training data contains 32 input attributes (predictors) that describe an Hbond and its local environment in a conformation c and the output attribute is the probability that the Hbond will be present in an arbitrary conformation of this protein achievable from c within a time duration Δ. We model dependence of the output variable on the predictors by a regression tree.
Results
Several models are built using 6 MD simulation trajectories containing over 4000 distinct Hbonds (millions of occurrences). Experimental results demonstrate that such models can predict Hbond stability quite well. They perform roughly 20% better than models based on Hbond energy alone. In addition, they can accurately identify a large fraction of the least stable Hbonds in a conformation. In most tests, about 80% of the 10% Hbonds predicted as the least stable are actually among the 10% truly least stable. The important attributes identified during the tree construction are consistent with previous findings.
Conclusions
We use inductive learning methods to build proteinindependent probabilistic models to study Hbond stability, and demonstrate that the models perform better than Hbond energy alone.
Keywords
Background
A protein is a long sequence of aminoacids, called residues. Under normal physiological conditions, various forces (electrostatic, van der Waals, ...) lead the protein to fold into a compact structure made of secondary structure elements, αhelices and βstrands, connected by bends (called loops). An Hbond corresponds to the attractive electrostatic interaction between a covalent pair D—H of atoms, in which the hydrogen atom H is bonded to a more electronegative donor atom D, and an electronegative acceptor atom A. Due to their strong directional character, short distance ranges, and large number in folded proteins, Hbonds play a key role in both the formation and stabilization of protein structures [1–3]. While Hbonds involving atoms from close residues along the mainchain sequence stabilizes secondary structure elements, Hbonds between atoms in distant residues stabilize the overall 3D arrangement of secondary structure elements and loops.
Hbonds form and break while the conformation of a protein deforms. For instance, the transition of a folded protein from a nonfunctional state into a functional (e.g., binding) state may require some Hbonds to break and others to form [4]. So, to better understand the possible deformation of a folded protein, it is desirable to create a reliable model of Hbond stability. Such a model makes it possible to identify rigid groups of atoms in a given protein conformation and determine the remaining degrees of freedom of the structure [7]. Since most Hbonds in a protein conformation are quite stable, it is crucial that the model precisely identifies the least stable bonds. The intrinsic strength of an individual Hbond has been studied before from an energetic viewpoint [5, 6]. However, potential energy alone may not be a very good predictor of Hbond stability. Other local interactions may reinforce or weaken an Hbond.
Methods
I. Problem statement
where I (q, H, t) is a Boolean function that takes value 1 if H is present in the conformation q(t) at time t along trajectory q, and 0 otherwise. The value can be interpreted as the probability that H will be present in the conformation of P at any specified time t ∈ (0, Δ), given that P is at conformation c at time 0. Our goal is to design a method for generating good approximations σ of . We also want these approximations to be proteinindependent.
II. General approach
We use machine learning methods to train a stability model σ from a given set Q of MD simulation trajectories. Each trajectory q ∈ Q is a sequence of conformations of a protein. These conformations are reached at times t_{ i } = i × δ, i = 0, 1, 2, …, called ticks, where δ is typically on the order of picoseconds. We detect the Hbonds present in each conformation q(t_{ i }) using the geometric criteria given in [8]. Note that an Hbond in a given protein is uniquely identified (across different conformations) by its donor, acceptor, and the hydrogen atom. So, we call the presence of a specific Hbond H in a conformation q(t_{ i }) an occurrence of H, denoted by h.
For each h, we compute a fixed list of predictors, some numerical, others categorical. Some are timeinvariant, like the number of residues along the mainchain between the donor and acceptor atoms. Others are timedependent. Among them, some describe the geometry of h, e.g., the distance between the hydrogen and the donor. Others describe the local environment of h, e.g., the number of other Hbonds within a certain distance from the midpoint of H.
We train σ as a function of these predictors. The predictor list defines a predictor space ∑ and every Hbond occurrence maps to a point in ∑. Given the input set Q of trajectories, we build a data table in which each row corresponds to an occurrence h of an Hbond present in a conformation q(t_{ i }) contained in Q. So, many rows may correspond to the same Hbond at different ticks. In our experiments, a typical data table contains several hundred thousand rows. Each column, except the last one, corresponds to a predictor p and the entry (h, p) of the table is the value of p for h. The entry in the last column is the measured stability y of h. More precisely, let H be the Hbond of which h is an occurrence. Let l = Δ/δ, where Δ is the duration over which we wish to predict the stability of h, and m ≤ l be the number of ticks t_{ k }, k = i + 1, i + 2,…,i + l, such that H is present in q(t_{ k }). The measured stability y of h is the ratio m/l. We chose l = 50 in most of the tests reported below, as this value both provides a ratio m/l large enough for the measured stability to be statistically meaningful, and corresponds to an interesting prediction timescale (50ps). Typically, most Hbond occurrences are quite stable: over 25% have measured stability 1, about 50% higher than 0.8, and only 15% less than 0.3.
We build σ as a binary regression tree [9]. This machine learning approach has been one of the most successful in practice. Regression trees are often simple to interpret. The method can work with both categorical and numerical predictors in a unified way, as shown in Section III. Each nonleaf node in a regression tree is a Boolean split. So, each node N of the tree determines a region of ∑ in which all the splits associated with the arcs connecting the root of the tree to N are satisfied. We say that an Hbond occurrence falls into N if it is contained in this region. The predicted stability value stored at a leaf node L is the average of the measured stability values by all the Hbond occurrences in the training data table that fall into L. We expect this average, which is taken over many pieces of trajectories, to approximate well the average defined in Equation (1).
Once a regression tree has been generated, it is used as follows. Given an Hbond H in an arbitrary conformation c of an arbitrary protein, the leaf node L of the tree into which H falls is identified by calculating the values of the necessary predictors for H in c. The predicted stability value stored at L is returned.
III. Training algorithm
We construct a model σ as a binary regression tree using the CART method [9]. The tree is generated recursively in a topdown fashion. When a new node N is created, it is inserted as a leaf of the tree if a predefined depth has been reached or if the number of h falling into N is smaller than a predefined threshold. Otherwise, N is added as an intermediate node, its split is computed, and its left and right children are created. A split s is defined by a pair (p, r), where p is the split predictor and r is the split value. If p is a numerical predictor, then r is a threshold on p, and s ≜ p <r. If p is a categorical predictor, then r is a subset of categories, and s ≜ p ∈ r. We define the score w(p, r) of split s = (p, r) at a node N as the reduction of variance in measured stability that results from s. The algorithm chooses the split—both the predictor and the split value—that has the largest score. Only a relatively small subset of predictors selected by the training algorithm is eventually used in a regression tree.
To prevent model overfitting, we limit tree depth to 5 in most of our experiments and limit the minimal number of training samples in an intermediate node to be 10. We further prune the obtained tree using the following adaptive algorithm. We initially set aside a fraction of the training data table called validation subset. Once a tree has been constructed pruning is an iterative process. At each step, one intermediate node N whose split has minimal score becomes a leaf node by removing the subtree rooted at N. This process creates a sequence of trees with decreasing numbers of nodes. We compute the mean square error of the predictions made by each tree on the validation subset. The tree with the smallest error is selected.
Results
I. Experimental setup
Characteristics of the MD simulation trajectories used to create the 6 datasets
Trajectory  Protein  # res.  Force field  Duration  # Hbonds  # occurrences 

1c9oA  Cold shock protein  66  ENCAD [12] with F3C explicit water model  10ns  263  363463 
1e85A  Cytochrome C  124  Same as above  10ns  525  1253879 
1g9oA_1  PDZ1 domain of human Na(+)/H(+) exchanger regulatory factor  91  Same as above  10ns  374  558761 
1g9oA_2  Same as above  91  Same as above  10ns  397  544491 
complex  EfbC/C3d complex formed by the C3d domain of human Complement Component C3 and one of its bacterial inhibitors  362  Amber 2003 with implicit solvent using the General Born solvation method [13]  2ns  1825  348943 
1eia  EIAV capsid protein P26  207  Amber 2003 with SPC/E water model  2ns  757  379573 
From each trajectory we derived a separate data table in which the rows represent Hbond occurrences. Last two columns in Table 1 list the number of distinct Hbonds detected in each trajectory and the total number of Hbond occurrences extracted. Note that complex was generated for a complex of two molecules. All Hbonds occurring in this complex are taken into account in the corresponding data table.
The values of the timevarying predictors are subject to thermal noise. Since a model σ will in general be used to predict Hbond stability in a protein conformation sampled using a kinematic model ignoring thermal noise (e.g., by sampling the dihedral angles ϕ, ψ, and χ) [7], we chose to average the values of these predictors over l' ticks to remove thermal noise. More precisely, the value of a predictor stored in the row of the data table corresponding to an Hbond occurrence in q(t_{ i }) is the average value of this predictor in , where . Our analysis shows that l' = 50 is near optimal.
The performance of a regression model can be measured by the root mean square error (RMSE) of the predictions on a test dataset. For a data table T = {(x_{1}, y_{1}), (x_{2}, y_{2})},…, (x_{ n }, y_{ n })}, where each x_{ i }, i = 1,…,n, denotes a vector of predictor values for an Hbond occurrence and y_{ i } is the measured stability of the Hbond, and a model σ, the RMSE is defined by: . As RMSE depends not only on the accuracy of σ, but also on the table T, some normalization is necessary to compare results on different tables. So, in our tests we compute the decrease of RMSE relative to a base model σ_{0}. The relative base error decrease (or RBED) is then defined by: In most cases, σ_{0} is simply defined by , i.e., the average measured stability of all Hbond occurrences in the dataset. In other cases, σ_{0} is a model based on the Hbond energy.
II Generality of models trained on multiple trajectories
Mean RBED values of models trained on multiple trajectories
Fraction of data  Max tree depth  1c9oA  1e85A  1g9oA_1  1g9oA_2  complex  1eia  Average 

0.1  5  46.92  59.37  50.93  45.29  37.90  42.60  47.17 
0.5  5  47.07  59.59  50.69  45.45  38.08  43.15  47.34 
0.1  15  47.24  59.03  51.42  45.65  38.07  43.35  47.46 
0.5  15  46.87  59.04  51.38  45.89  38.38  43.46  47.50 
RBED values show that regression tree model significantly reduces base error and keeps predictive power when applied to a protein not present in the training data. Moreover, the variance of RBED values is very small, meaning that the training process yields models that are stable in performance. Furthermore, the RBED values are lower for models tested on complex. Recall that the trajectory complex was generated for a complex made of a protein and a ligand, while every other trajectory was generated for a single protein. So, it is likely that complex contains Hbonds in situations that did not occur in any of the other trajectories. Both deeper trees and larger data fractions tend to improve model accuracy, but the very small gain is not worth the additional model or computation complexity.
III. Comparison with FIRSTenergy model
Mean RBED values of models using single predictor FIRST_energy
1c9oA  1e85A  1g9oA_1  1g9oA_2  complex  1eia 

26.36  27.95  22.63  19.63  19.42  5.65 
IV. Identification of least stable Hbonds
Most Hbond occurrences tend to be stable. So, accurately identifying the weakest ones is important if one wishes to predict the possible deformation of a protein [7]. To evaluate how well our models identify the least stable Hbonds occurrences, we first identify the subset S of the 10% Hbond occurrences with the smallest measured stability in each test table T. Using a regression tree σ obtained in Section II, we sort the Hbond occurrences in T in ascending order of predicted stability and we compute the fraction w ∈ [0,1] of S that is contained in the first 100×u% occurrences in this sorted list, for successive values of u ∈ [0,1]. We call the function w(u) the identification curve of the least stable Hbonds for σ.
Discussion
In all our regression trees the root split was done with predictor Dist_H_A (the distance between the hydrogen and acceptor atoms), which therefore appear as the single most discriminative attribute to predict Hbond stability. This observation is consistent with previous findings. Levitt [6] found that most stable Hbonds have Dist_H_A less than 2.07Å. Jeffrey and Saenger [14] also suggested that Dist_H_A is a key attribute affecting Hbond stability, with a value less than 2.2Å for moderate to strong Hbonds. Consistent with these previous findings, the split values of the deepest Dist_H_A nodes in all our regression trees are around 2.1Å. This distance was observed in [6] to sometimes fluctuate by up to 3Å in stable Hbonds, due to highfrequency atomic vibration. This observation supports our decision to average predictor values over windows of l’ ticks.
Predictor FIRST_energy is often used in splits close to the root. This is not surprising since it is a function of several other pertinent predictors: Dist_H_A, Angle_D_H_A (the angle between the donor, the hydrogen atom, and the acceptor), Angle_H_A_AA (the angle between the hydrogen atom, the acceptor, and the atom covalentlybonded to the acceptor), and the hybridization state of the bond. Some other distancebased predictors (Dist_D_AA, Dist_D_A, Dist_H_D), anglebased predictors and Ch_type (describing whether the donor and acceptor are from mainchain or sidechain) predictor appear often in regression trees, but closer to the leaf nodes. They nevertheless play a significant role in predicting Hbond stability. For example, as shown in Figure 1, if Angle_H_A_AA is at least 105Â°, the stability is very high (about 0.96); otherwise, it drops to 0.71. The preference for larger angle matches well with the wellknown linearity of Hbonds [14].
Overall, we observe that predictors that describe the local environment of an Hbond play a relatively small role in predicting its stability. In particular, we had expected that descriptors such Num_hb_spaceNbr and Num_hb_spaceRgdNbr, which count the number of other Hbonds located in the neighborhood of the analyzed Hbond, would have had more importance. However, this may reflect the fact that the MD simulation trajectories used in our tests are too short to contain enough information to infer the role of such predictors. Indeed, while transitions between metastable states are rare in those trajectories, predictors describing local environments may have greater influence on the stability of Hbonds that must break for such transitions to happen. So, longer trajectories may eventually be needed to better model Hbond stability.
Conclusions
We have described machine learning methods to train proteinindependent regression trees modeling Hbond stability in proteins. Test results demonstrate that trained models can predict Hbond stability quite well. In particular, their performance is significantly better (roughly 20% better) than that of a model based on Hbond energy alone. They can accurately identify a large fraction of the least stable Hbonds in a given conformation. However, our results also suggest that better results could be obtained with a richer set of MD simulation trajectories. In particular, the trajectories used in our experiments might be too short to characterize the stability of Hbonds that break and form during a transition between metastable states.

It would be better to averaging predictor values before subsampling MD simulation trajectories. This would reduce the risk of filtering out changes in predictor values that are important for Hbond stability. Unfortunately, in our trajectories we only had access to the data after subsampling.

More sophisticated learning techniques could be used. For example, instead of generating a single tree, we could generate an ensemble of trees, such as Gradient Boosting Trees [16].

Finally, the notion of stability itself could be refined, for example by distinguishing between the case where an Hbond frequently switches on and off during a prediction window and the case where it rarely switches.
Declarations
Acknowledgements
This work was supported in part by a grant from the KAUSTStanford Academic Excellence Alliance program. The authors thank L. Kavraki (Rice Univ.), V. Pande (Stanford), M. Levitt (Stanford), and J. Wang Tsai (Univ. of the Pacific) for providing us MD simulation trajectories and for useful comments during our work.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/12?issue=S1.
Authors’ Affiliations
References
 Baker EN: Hydrogen bonding in biological macromolecules. International Tables for Crystallography 2006, F: 546–552. Chapter 22.2 Chapter 22.2 full_textView ArticleGoogle Scholar
 Fersht AR, Serrano L: Principles in protein stability derived from protein engineering experiments. Curr. Opin. Struct. Biol 1993, 3: 75–83. 10.1016/0959440X(93)90205YView ArticleGoogle Scholar
 Schell D, Tsai J, Scholtz JM, Pace CN: Hydrogen bonding increases packing density in the protein interior. Proteins 2006, 63: 278–282. 10.1002/prot.20826View ArticlePubMedGoogle Scholar
 Bikadi Z, Demko L, Hazai E: Functional and structural characterization of a protein based on analysis of its hydrogen bonding network by hydrogen bonding plot. Arch Biochem Biophys 2007, 461: 225–234. 10.1016/j.abb.2007.02.020View ArticlePubMedGoogle Scholar
 Dahiyat BI, Gordon DB, Mayo SL: Automated design of the surface positions of protein helices. Protein Science 1997, 6: 1333–1337. 10.1002/pro.5560060622PubMed CentralView ArticlePubMedGoogle Scholar
 Levitt M: Molecular dynamics of hydrogen bonds in bovine pancreatic trypsin unhibitor protein. Nature 1981, 294: 379–380. 10.1038/294379a0View ArticlePubMedGoogle Scholar
 Thorpe MF, Lei M, Rader AJ, Jacobs DJ, Kuhn LA: Protein flexibility and dynamics using constraint theory. J. Mol. Graph. Model 2001, 19: 60–69. 10.1016/S10933263(00)001224View ArticlePubMedGoogle Scholar
 McDonald IK, Thornton JM: Satisfying hydrogen bonding potential in proteins. J. Mol. Biol 1994, 238: 777–793. 10.1006/jmbi.1994.1334View ArticlePubMedGoogle Scholar
 Breiman L, Friedman JH, Ilshen RA, Stone CJ: Classification and regression trees. CRC Press; 1984.Google Scholar
 Joo H, Qu X, Swanson R, McCallum CM, Tsai J: Modeling the dependency of residue packing upon backbone conformation using molecular dynamics simulation. Comput. Biol. Chem 2010.Google Scholar
 Haspel N, Ricklin D, Geisbrecht B, Lambris JD, Lydia EK: Electrostatic contributions drive the interaction between staphylococcus aureus protein EfbC and its complement target C3d. Protein Sci 2008, 17: 1894–1906. 10.1110/ps.036624.108PubMed CentralView ArticlePubMedGoogle Scholar
 Levitt M, Hirshberg M, Sharon R, Daggett V: Potential Energy Function and Parameters for simulations of the Molecular Dynamics of Proteins and Nucleic Acids in Solution. Computer Physics Communications 1995, 91: 215–231. 10.1016/00104655(95)00049LView ArticleGoogle Scholar
 Srinivasan J, Trevathan M, Beroza P, Case D: Application of a pairwise generalized Born model to proteins and nucleic acids: inclusion of salt effects. Theor. Chem. Acc 1999, 101: 426–434.View ArticleGoogle Scholar
 Jeffrey GA, Saenger W: Hydrogen bonding in biological structures. In SpringerVerlag. Germany; 1991.Google Scholar
 Tuv E, Borisov A, Torkokola K: Best subset feature selection for massive mixedtype problems. IDEAL 2006, 4224: 1048–1056.Google Scholar
 Friedman JH: Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 2000, 29: 1189–1232. 10.1214/aos/1013203451View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.