Implementation of 3D spatial indexing and compression in a large-scale molecular dynamics simulation database for rapid atomic contact detection
© Toofanny et al; licensee BioMed Central Ltd. 2011
Received: 19 April 2011
Accepted: 10 August 2011
Published: 10 August 2011
Molecular dynamics (MD) simulations offer the ability to observe the dynamics and interactions of both whole macromolecules and individual atoms as a function of time. Taken in context with experimental data, atomic interactions from simulation provide insight into the mechanics of protein folding, dynamics, and function. The calculation of atomic interactions or contacts from an MD trajectory is computationally demanding and the work required grows exponentially with the size of the simulation system. We describe the implementation of a spatial indexing algorithm in our multi-terabyte MD simulation database that significantly reduces the run-time required for discovery of contacts. The approach is applied to the Dynameomics project data. Spatial indexing, also known as spatial hashing, is a method that divides the simulation space into regular sized bins and attributes an index to each bin. Since, the calculation of contacts is widely employed in the simulation field, we also use this as the basis for testing compression of data tables. We investigate the effects of compression of the trajectory coordinate tables with different options of data and index compression within MS SQL SERVER 2008.
Our implementation of spatial indexing speeds up the calculation of contacts over a 1 nanosecond (ns) simulation window by between 14% and 90% (i.e., 1.2 and 10.3 times faster). For a 'full' simulation trajectory (51 ns) spatial indexing reduces the calculation run-time between 31 and 81% (between 1.4 and 5.3 times faster). Compression resulted in reduced table sizes but resulted in no significant difference in the total execution time for neighbour discovery. The greatest compression (~36%) was achieved using page level compression on both the data and indexes.
The spatial indexing scheme significantly decreases the time taken to calculate atomic contacts and could be applied to other multidimensional neighbor discovery problems. The speed up enables on-the-fly calculation and visualization of contacts and rapid cross simulation analysis for knowledge discovery. Using page compression for the atomic coordinate tables and indexes saves ~36% of disk space without any significant decrease in calculation time and should be considered for other non-transactional databases in MS SQL SERVER 2008.
Molecular dynamics (MD) simulations are routinely used to study the dynamic and structural properties of proteins and other macromolecules. MD simulations provide atomic-level resolution of a protein andits surrounding solvent environment as a function of time. There are no experimental techniques that can provide this level of detail. The direct results of an MD simulation are the coordinates of all atoms as a function of simulation time. Simulation time is divided into discrete time points or frames (akin to movie frames) that represent the coordinates for the entire system at that precise time. The assembled coordinate 'trajectories' (i.e. all frames) can be analysed for various factors and visualized to produce movies (examples of which can be found at http://www.dynameomics.org).
Nonbonded interactions within a protein are critical to its thermodynamic behaviour, contributing to packing and electrostatic energies reflected in the enthalpy. Such nonbonded interactions include but are not limited to hydrogen bonds, salt bridges, and hydrophobic contacts. Fluctuations in these nonbonded contacts as a function of time dictate dynamic behaviour and the conformations accessible to the protein. Dynamics are crucial for our understanding of protein function , folding and misfolding [2, 3].
We have recently undertaken and completed a large scale project, Dynameomics, in which we have simulated the native states and unfolding pathways of representatives of essentially all autonomous protein fold families . These fold families, or metafolds, were chosen based on a consensus between the SCOP, CATH and DALI domain dictionaries, which we call a consensus domain dictionary (CDD) [5, 6]. For our recent release set  there are 807 metafolds, representing 95% of the known autonomous domains in the Protein Data Bank (PDB). The Dynameomics database represents the largest collection of protein simulations in the world and contains 104 more structures than the PDB.
The coordinates of the MD simulations and our set of standard analyses have been loaded into a relational database. This Dynameomics database is implemented using Microsoft SQL server with the Windows Server operating system (see  for a more detailed description). The Dynameomics protocol includes one native state simulation, and at least 5 thermal unfolding simulations, which can be used to characterize the unfolding process of the domains. In order to explore the dynamics and folding in these simulations we often calculate the nonbonded contacts for each frame of the simulation. This problem has been well studied and is also known as the nearest neighbor search problem . The calculation is computationally expensive; as the naïve approach is to test all possible pairs of atoms in the system. The number of protein atoms or amino acids is often used as a proxy for the overall simulation size. The average number of protein atoms in the proteins in our Dynameomics set of simulations is 2150, with the smallest system consisting of 494 protein atoms and the largest of 6584 protein atoms. As all of the atoms in our simulations are in motion, all pairs of atoms need to be re-evaluated for each frame of the simulation, so in the case of a 51 ns native state simulation sampled at 1 picosecond (ps) resolution, we have 51,000 frames of pairs of contacts to evaluate. Calculating the nonbonded contacts without any acceleration method is not practical for a large number of simulations such as in a project like Dynameomics.
Spatial indexing overview
Spatial indexing is a commonly used method by programmers of 3D video games, in which collision between objects are detected , though the methods date back further in molecular simulation [10, 11] and other approaches similar in spirit have been described . The basic approach based on the cell index method  is as follows: in order to accelerate the detection of near neighbour objects in 3D space, the space is split into relatively uniform small 3D bins. Each of the bins is given an index and the objects in the system are sorted into the indexed bins based on their 3D coordinates. Neighboring objects can then be detected by performing a distance calculation on all pairs of objects in the same or immediately adjacent neighboring bins. There are a number of other algorithms that could be used to speed up the discovery of nearest neighbors including B-trees, kd-trees, Z-order curves, Verlet neighbor lists, however, we decided to implement the cell index like method  since we already have experience in implementing this in our in-house MD simulation software and have found it to be very effecient.
SQL Server 2008 supports two types of compression, which can be applied separately to the data and indices associated with a table (row and page level compression is only available in MS SQL server 2008). Row compression is a more efficient representation of row data; it involves storing fixed length columns in a manner similar to variable length columns where repeated bytes are compressed. For coordinate columns, which are a set of five 32 bit fixed length columns, the storage savings for row compression are small. Page compression, which is built on top of row compression, stores repeating values in a single structure for each page and then references that structure. This can result in significant savings as coordinate tables contain numerous columns with repeated data like atom number that are used for relational joins to retrieve atom information like name, mass, element.
Results and Discussion
11 representative proteins - number of residues and number of atoms
# protein atoms
Domain of Adr1 DBD from S. cerevisiae
Thymus and activation-regulated chemokine
Domain of transforming growth factor-beta 2 (TGF- B2)
Horse plasma gelsolin
Domain of serum transferrin
Human growth hormone
Monomer of glucose-1-phosphate thymidylyltransferase
Alginate Lyase A1-III
The significant decrease in execution time for identification of nonbonded contacts had three significant implications. First, contact calculations are substantially more tractable for very large proteins in Dynameomics. Considering one of the largest fold representatives (1ehe) in our Dynameomics set, which contains 399 residues (plus a heme moiety), the average execution time dropped from 18 minutes and 10 seconds to just under 1 minute and 45 seconds. Second, the query execution time is fast enough to enable us to perform large-scale multi-simulation analyses. Dynameomics is really about the knowledge discovery over a large number of protein systems. For example, a key query for Dynameomics is to identify all of the types of hydrophobic contacts across the native state simulations for all of the 807 metafold representatives to identify patterns. Such an all-encompassing search is no longer impractical as contact queries across multiple servers can be executed to return the contact set rapidly. Third, as the calculation can be run in near real time, contact queries can be performed on the fly where the result set can be streamed through analyses rather than stored permanently and regenerated when required. The cost of disk space to save the contact results may exceed the size of the original coordinate data from which they were derived. Hence, we would need to more than double the size of our existing database configuration if we were to consider storing the result of contact queries for all simulations. Furthermore, the ability to run ad-hoc on-the-fly analyses is the heart of our exploratory mining efforts for Dynameomics. Our exploratory visualization tool for extremely large datasets, dubbed DIVE (Data Intensive Visualization Engine) can connect to our SQL database and rapidly visualize, and act upon, millions of data points in many dimensions such as the nonbonded contact queries .
These data suggest that for other static (i.e. non-transactional) databases implemented in MS SQL Server 2008, compression may offer substantial disk savings. Furthermore, the framework of spatial indexing in a SQL database to speed up the discovery of near neighbours can be applied to other neighbor discovery problems, such as calculation of the distance between galaxies and/or planets in the field of astrophysics. The spatial indexing framework can be applied to those problems where the space is not bound to three-dimensions, but have fixed dimensional boundaries, and could be used to cluster highly dimensional data sets.
The spatial indexing implementation presented herein for our multi-terabyte MD simulation database decreases neighbour discovery and interaction query execution times by up to 90%. While the speed-up for small proteins was less pronounced, the implementation was suitable for all sizes of simulation systems without introducing overhead for small systems and significant improvement in performance for larger systems. In addition, this work shows that all sets of page and row compression across the data and indexes we tested have no appreciable effect on the run-time of the heavy-atom contact query. The page/page compression set for data in indexes yielded a 36% disk savings for full trajectories over non-compressed tables. This represents a huge savings for large data sets.
Details of how we selected the 807 metafolds for simulation in our Dynameomics project can be found elsewhere ([4, 5]). The MD simulations were performed using in lucem molecular mechanics (il mm)  following the Dynameomics protocol described by Beck et al. . Each of the metafolds had at least one native-state simulation performed at 298 K for at least 51 ns of simulation time, along with 5-8 unfolding simulations at 498 K with two of these simulations being at least 51 ns long. Structures were saved every 0.2 ps for the shorter simulations and every 1 ps for the longer simulations. Coordinates and analyses from the simulations were loaded into our Dynameomics database (for a more in-depth discussion on the development and technical details of the database see ).
When a simulation is loaded into the database, it is assigned a unique identifier and a specific location, i.e. server and database. Three tables were created in the assigned database to hold the underlying data for the simulation: a trajectory coordinate table, a box table, and bins table. Each table was named by the simulation identifier, for example the tables for simulation with identifier 37 would be "Coord_37," "Box_37," and "Bins_37." The coordinate table contained columns for each of the three-dimensional coordinates, atom number, step, structure identifier, and instance (which is used to identify monomers in a multimer system). The box table had columns for the x, y, and z dimensions of the periodic box at each time point. The bins table recorded the set of adjacent bins for each primary bin in the box. All three tables had clustered primary keys and constraints and the coordinate table also had a secondary covering index.
We selected 11 metafolds to represent the range in sequence size that our Dynameomics project covers from the smallest: ADR1 DNA-binding domain from Saccharomyces Cerevisiae (2adr, 29 residues and a zinc ion, ); to one of our largest: cytochrome P450 (1ehe, 399 residues and heme, ). Figure 2 shows the metafolds selected. In the test conducted in this study we chose to look at the 51 ns native state (298 K) simulations for each of these proteins.
Implementation of spatial indexing in the database
There are three supported join types in SQL Server: Hash, Merge, and Loop. Normally queries are expressed using only the keyword JOIN, leaving the optimizer free to choose the join type when an execution plan for a query is prepared. Join types are described in detail elsewhere . The self-join of the coordinate table presented unique difficulty because of the size of the coordinate table. The optimizer will consistently choose a hash join, which will cause an expensive build of a temporary hash structure. In contrast, the merge join type does not require the temporary structure, and as the data are ordered based on the primary key, this approach is significantly faster.
We optimized the structure of the query with the use of two right associative joins  to cause early evaluation of the coordinate and atom ID table joins. We also pushed predicates directly into the join clauses. However, despite these optimizations a great deal of time was spent calculating distances for atoms that are outside the 5.4 Å distance of interest. These additional calculations added a significant performance burden, making it impractical to run this query over more than a handful of trajectories.
With the bins table in place, the contact query presented earlier can be modified slightly to filter coordinates considered using the bin column in the coordinate table. The modification is shown in bold (Figure 7). This simple join allows the query optimizer to quickly remove distance calculations based on a comparison of integer columns instead of projecting and transforming x, y, z from each half of the join. In this way, the bins table acted as a highly optimized spatial index.
Table and index compression in the database
To investigate the effect of compression on database queries, we returned to the contacts query introduced earlier in this section, as it is a commonly used and computationally expensive query in trajectory analysis and reviewed performance data collected against all combinations of compression options across our sample set of 11 protein simulations. We also considered non-compressed and fully page compressed contact queries for the first 1 nanosecond that did not utilize the spatial indexing optimization.
Database and System setup
Detailed hardware and software configuration information
Dual Intel Xeon X5650s (x64 Hex Core)
H700 Integrated RAID SAS Disk Controller
136 GB on two 15K RPM 150GB SAS disks, RAID 1 (Mirrored)
7,450 GB on six 7200 RPM 2TB SAS disks, RAID 0 (Striped)
Windows Server 2008 R2 Enterprise x64
SQL Server 2008 R2 Enterprise x64
Enabled for all CPUs
Limited to 40,960 MB (8GB for OS)
Sophos Endpoint Security and Control, version 9
One database called hash3d-700 was created on each server and populated with a copy of coordinate trajectory tables and dimension tables from our primary data warehouse [7, 21]. The base coordinate tables were then copied to additional tables, adding an additional suffix to indicate data and index compression settings. After all coordinate tables were created and populated, identical primary keys, constraints and indexes were applied. Tables were then compressed using ALTER TABLE statements. A script was run on all the coordinate table compression combinations to create contact tables. The size of each hash3d-700 database size was then adjusted upwards to 1.2 TB and the SQL Server process shutdown. Finally, the data and system partitions were defragmented with the defrag.exe to clean up file system fragmentation caused by auto-growth during loading.
Queries were run in SQL Server Management studio running on a remote machine with a connection to the test database server. Queries were executed with SET STATISTICS IO ON and SET STATISTICS TIME ON to capture logical and physical read statistics. To control for performance gains caused by data and/or query plan caching; and background write operations from result tables, a series of three system statements were executed prior to running the test query (Additional file 4, Figure S1). The CHECKPOINT statement insures that any dirty pages (such as those result rows written out by the previous query) are written to disk. The FREESYSTEMCACHE command eliminates any stored query or procedure plans. The DROPCLEANBUFFERS flushes out the current cache leaving it effectively cold, as though SQL Server had just started. During the collection of run-time data, access to both servers was restricted and only the query of interest was permitted to run.
We calculated the pairs of heavy-atom contacts for the 1st nanosecond of each simulation (1000 frames) and compared the execution times with and without spatial indexing. Queries were written in SQL and executed in MS SQL management studio as described in the above section. Heavy-atom contacts calculations were performed in triplicate for each simulation, ensuring the system cache was cleared between each run to obtain accurate performance statistics. Statistics were calculated using a two sample two-sided t-test for unequal variances.
We investigated 9 sets of compression options on both data and indices for each coordinate table for the 11 simulations in our test set. We recorded the extent of compression of each set of compression options compared with the non-compressed coordinate tables. We then ran an initial test of performance by investigating the execution time and disk input and output operations of the heavy-atom contacts query over the first nanosecond of the simulation. Subsequently, we examined the execution time of the heavy-atom contacts query over the full 51 ns (51,000 frames) native state trajectory for each of the proteins in our test set.
We are grateful for financial support provided by the National Institutes of Health (GM50789). D.A.C.B's involvement was supported in part by the University of Washington's eScience Institute. Dynameomics simulations were performed using computer time through the DOE Office of Biological Research as provided by the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under contract no. DE-AC02-05CH11231. We are also grateful for support from Microsoft for development of our database.
- Karplus M, Kuriyan J: Molecular dynamics and protein function. Proc Natl Acad Sci USA 2005, 102(19):6679–6685. 10.1073/pnas.0408930102PubMed CentralView ArticlePubMedGoogle Scholar
- Fersht AR, Daggett V: Protein folding and unfolding at atomic resolution. Cell 2002, 108(4):573–582. 10.1016/S0092-8674(02)00620-7View ArticlePubMedGoogle Scholar
- Chiti F, Dobson CM: Protein misfolding, functional amyloid, and human disease. Annu Rev Biochem 2006, 75: 333–366. 10.1146/annurev.biochem.75.101304.123901View ArticlePubMedGoogle Scholar
- van der Kamp MW, Schaeffer RD, Jonsson AL, Scouras AD, Simms AM, Toofanny RD, Benson NC, Anderson PC, Merkley ED, Rysavy S, et al.: Dynameomics: a comprehensive database of protein dynamics. Structure 2010, 18(4):423–435. 10.1016/j.str.2010.01.012PubMed CentralView ArticlePubMedGoogle Scholar
- Schaeffer RD, Jonsson AL, Simms AM, Daggett V: Generation of a consensus protein domain dictionary. Bioinformatics 2011, 27(1):46–54. 10.1093/bioinformatics/btq625PubMed CentralView ArticlePubMedGoogle Scholar
- Day R, Beck DA, Armen RS, Daggett V: A consensus view of fold space: combining SCOP, CATH, and the Dali Domain Dictionary. Protein Science 2003, 12(10):2150–2160.PubMed CentralView ArticlePubMedGoogle Scholar
- Simms AM, Toofanny RD, Kehl C, Benson NC, Daggett V: Dynameomics: design of a computational lab workflow and scientific data repository for protein simulations. Protein Engineering Design & Selection 2008, 21(6):369–377. 10.1093/protein/gzn012View ArticleGoogle Scholar
- Clarkson K: Nearest-neighbor searching and metric space dimensions. In Nearest-Neighbor Methods for Learning and Visions: Theory and Practice. Cambridge, MA: MIT press; 2005.Google Scholar
- Lefebvre S, Hoppe H: Perfect Spatial Hashing. ACM Transactions on Graphics 2006, 25(3):579–588. 10.1145/1141911.1141926View ArticleGoogle Scholar
- Hockney RW, Eastwood JW: Computer Simulation Using Particles. New York: McGraw-Hill; 1981.Google Scholar
- Allen MP, Tildesley DJ: Computer Simulation of Liquids. Oxford: Oxford University Press; 1987.Google Scholar
- Yip V, Elber R: Calculations of a list of neighbors in Molecular Dynamics simulations. Journal of Computational Chemistry 1989, 10(7):921–927. 10.1002/jcc.540100709View ArticleGoogle Scholar
- Beck DAC, Alonso DOV, Daggett V: in lucem Molecular Mechanics ( il mm). University of Washington, Seattle; 2000.Google Scholar
- Beck DAC, Daggett V: Methods for molecular dynamics simulations of protein folding/unfolding in solution. Methods in Enzymology 2004, 34(1):112–120. 10.1016/j.ymeth.2004.03.008View ArticleGoogle Scholar
- Beck DA, Jonsson AL, Schaeffer RD, Scott KA, Day R, Toofanny RD, Alonso DO, Daggett V: Dynameomics: mass annotation of protein dynamics and unfolding in water by high-throughput atomistic molecular dynamics simulations. Protein Engineering Design & Selection 2008, 21(6):353–368. 10.1093/protein/gzn011View ArticleGoogle Scholar
- Bromley D, Rysavy S, Beck DA, Daggett V: DIVE: A Data Intensive Visualization Engine. Microsoft Research eScience Workshop 2010.Google Scholar
- Bowers PM, Schaufler LE, Klevit RE: A folding transition and novel zinc finger accessory domain in the transcription factor ADR1. Nat Struct Biol 1999, 6(5):478–485. 10.1038/8283View ArticlePubMedGoogle Scholar
- Shimizu H, Park S, Lee D, Shoun H, Shiro Y: Crystal structures of cytochrome P450nor and its mutants (Ser286-->Val, Thr) in the ferric resting state at cryogenic temperature: a comparative analysis with monooxygenase cytochrome P450s. J Inorg Biochem 2000, 81(3):191–205. 10.1016/S0162-0134(00)00103-3View ArticlePubMedGoogle Scholar
- Fritchey G, Dam S: SQL Server 2008 Query Performance Tuning Distilled. Apress, New York; 2009.View ArticleGoogle Scholar
- David MM: Advanced ANSI SQL data modeling and structure processing. Boston: Artech House; 1999.Google Scholar
- Simms AM, Daggett V: Protein simulation data in the relational model. J of Supercomp 2011, in press.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.