Nuclear Magnetic Resonance (NMR) chemical shift prediction has developed into a valuable tool for computational structural biology and biomolecular NMR spectrometry. While the high dependency of NMR chemical shifts on structural details renders their prediction a formidable task, it simultaneously makes them a very valuable source of structural information for applications like structure determination and optimization [1–4], and the scoring of protein-protein docking results [5–7]. Hence, a variety of fields profit from efficient and accurate chemical shift estimation from a three-dimensional model of the molecule under consideration – in docking, for instance, predicted shifts for the candidate complexes are compared to the experimentally observed ones as a measure of reconstruction error.
The development of novel NMR chemical shift prediction techniques is a challenging task. Previous approaches either focus on full quantum mechanical models (e.g. [8–11]) which are computationally very expensive, or settle for so called semi-classical approximations borrowed from classical physics [5, 12–14]. As a third option, prediction techniques can use statistical models based on semi-classical, structural, or sequential features of the proteins (e.g. PROSHIFT , ShiftX [14, 16], SPARTA , CamShift , BioShift , SPARTA+, ShiftX2 , or shAIC ). For high-throughput applications, the most successful approaches today offer good prediction accuracy with relatively low computational cost by combining semi-classical and statistical approaches. These techniques are known as hybrid methods.
Developing a new hybrid method, or extending an existing one, is a hard and complex task for which three questions have to be addressed: which features should be included into the model, which statistical technique should be employed, and which data set can be used.
The question of data set generation in particular is a very difficult one. The required information for creating such a data set is spread over several data bases, such as the Biological Magnetic Resonance Bank (BMRB)  and the Protein Data Bank (PDB)  and is stored in different, notoriously hard-to-parse, file formats. To make matters worse, real-life data sets often contain serious syntactical, semantical, and logical errors or inconsistencies. Consequently, a number of publications [21, 25–29] discuss the necessity of checking and correcting the given data, BMRB NMR data and PDB coordinates alike. Typical issues for creating NMR chemical shift prediction data sets are the completeness and quality of PDB files, the chemical re-referencing problem for NMR data, and the exclusion of homologs within the data set. For BMRB files, a number of approaches (e.g. [25–28]) have been developed to detect and correct assignment and referencing errors. Mainly due to these complications, most former approaches rely on hand-curated data sets created by the application of unstandardized sequences of restriction and correction methods.
Another challenge when training prediction models is the computation of the semi-classical terms, or the structural and sequential features to learn from. Computing these terms and molecular features correctly, reliably, and efficiently requires complex molecular data structures and algorithms. Further technical challenges are the computation and the choice of a statistical model.
In this work, we present an extensible automated framework called NightShift for data set generation and training of hybrid NMR chemical shift prediction methods. Most importantly, typical semi-classical terms for shift prediction are implemented and readily available. As of now, we include random coil contributions, aromatic ring current effects, electric field contributions, and hydrogen bonding effects. In addition, the feature set for the training of the statistical term encompasses sequential, structural (angles, surface, and density), force-field based, as well as experimental properties. All features are computed using our open source library BALL , and can be easily extended.
Due to its modular nature, the framework can employ data for both, protein structures and chemical shifts, from a variety of sources. As long as the input is available in the form of one of several recognized standard file formats (e.g., PDB for the proteins and NMRStar for the shift data), NightShift can easily train and evaluate models on it. As demonstrated in this paper, this freedom can, e.g., be helpful in addressing some of the current controversies in shift prediction. For instance, the user can freely decide on a shift reference correction method of his choice, or validate models trained on non-reference corrected data sets on rereferenced ones, or study the difference of models trained on X-Ray-derived protein structures to those based on NMR-derived ones.
To demonstrate that the data collected by the framework is indeed of use for NMR shift prediction, we train and evaluate a simple hybrid prediction model. The whole training and evaluation process is completeley automated and does not require human intervention. Based on recent research [16, 21], we choose a random forest model for the statistical contribution of this proof-of-concept predictor which is known for its prediction quality and efficiency, and in our experiments has demonstrated to yield very accurate and stable results. In general, however, the pipeline is model-agnostic and can be used with any regression technique implemented in R .