- Open Access
Single-molecule dataset (SMD): a generalized storage format for raw and processed single-molecule data
BMC Bioinformatics volume 16, Article number: 3 (2015)
Single-molecule techniques have emerged as incisive approaches for addressing a wide range of questions arising in contemporary biological research [Trends Biochem Sci 38:30–37, 2013; Nat Rev Genet 14:9–22, 2013; Curr Opin Struct Biol 2014, 28C:112–121; Annu Rev Biophys 43:19–39, 2014]. The analysis and interpretation of raw single-molecule data benefits greatly from the ongoing development of sophisticated statistical analysis tools that enable accurate inference at the low signal-to-noise ratios frequently associated with these measurements. While a number of groups have released analysis toolkits as open source software [J Phys Chem B 114:5386–5403, 2010; Biophys J 79:1915–1927, 2000; Biophys J 91:1941–1951, 2006; Biophys J 79:1928–1944, 2000; Biophys J 86:4015–4029, 2004; Biophys J 97:3196–3205, 2009; PLoS One 7:e30024, 2012; BMC Bioinformatics 288 11(8):S2, 2010; Biophys J 106:1327–1337, 2014; Proc Int Conf Mach Learn 28:361–369, 2013], it remains difficult to compare analysis for experiments performed in different labs due to a lack of standardization.
Here we propose a standardized single-molecule dataset (SMD) file format. SMD is designed to accommodate a wide variety of computer programming languages, single-molecule techniques, and analysis strategies. To facilitate adoption of this format we have made two existing data analysis packages that are used for single-molecule analysis compatible with this format.
Adoption of a common, standard data file format for sharing raw single-molecule data and analysis outcomes is a critical step for the emerging and powerful single-molecule field, which will benefit both sophisticated users and non-specialists by allowing standardized, transparent, and reproducible analysis practices.
Single-molecule techniques have proliferated over the past decade [1-4]. Despite the power of these techniques and their widespread use, critical assessment of single-molecule data remains challenging. While there are multiple reasons for this, principal among these are the inherent noise and stochasticity associated with single-molecule events, which contribute substantially to the analysis challenge. To help manage similarly complex data sets generated from a number of techniques used in modern biological research, other fields have adopted standard data file formats, repositories, and analysis approaches. Examples include the PDB file format for structural data; the RCSB PDB repository of biomolecular structures; the NIH GenBank, DDBJ, and EMBL ENA repositories of gene and genome sequences; the NCBI BLAST and Ensembl sequence alignment and analysis tools; and the CNSsolve biomolecular structure determination tool [5-14]. Standardization has been a key part of the development and advancement of these resources and techniques, facilitating data sharing and dissemination. In addition, the transparency of these formats, repositories, and tools encourages critical assessment of data. Individually the effect of these changes is difficult to assess, but cumulatively they contribute to increased reproducibility and reliability of measurements and, as a result, to the growth and widespread adoption of these techniques.
These examples represent important successes that have arisen naturally. However, several institutions and scientific leaders have recently begun to insist on greater transparency in the dissemination and treatment of all types of scientific data [15,16]. While there are many reasons for this desire and need, a number of well-documented instances within the drug discovery industry where the reproducibility of scientific results has been questioned [17-20] has raised awareness that a lack of easy access to raw data (arising from many sources) and a lack of tools for the primary analysis of the data can undermine clear communication of scientific results and can contribute to erroneous conclusions. Such high-profile problems cannot be attributed to any single failing, but a contributing cause is likely a current lack of standardization and control across the numerous measurement techniques that are combined to support these multidisciplinary development efforts [21,22].
Currently there is no standardization in place to unify the common aspects of most single-molecule data sets and to facilitate the use of the sophisticated analysis approaches that are continually being developed [23-32]. We propose the single-molecule dataset (SMD) file structure as a general data format for storing and disseminating single-molecule data. Moreover, we take steps to facilitate this transition by making two previously established data-analysis packages created in independent labs compatible with this format.
There are many commonalities in how single-molecule data are collected, stored, and analyzed. Figure 1A outlines three unifying relationships that form the basis of the SMD hierarchy. Most single-molecule datasets take the form of time series data (i.e., traces) that are acquired simultaneously from one or more channels during an experiment. While this is not always the rawest form of the data (e.g., a trace can be extracted from a movie recorded using a microscope that can simultaneously monitor many individual molecules), the single-molecule trace unifies many different techniques. At the highest level, a set of single-molecule traces (denoted as black rectangles in Figure 1A, top) are unified by the particular experiment that was used to generate them (denoted as a purple rectangle in Figure 1A, top). Finally, associated with each trace can be experimental information and quantities derived from the analysis of the raw single-molecule data (e.g., inferred kinetic and thermodynamic parameters from model fitting; denoted as orange rectangle in Figure 1A, bottom). The aim of SMD is to encapsulate this hierarchy in a file structure that is independent of any particular programming language, data acquisition platform, or data analysis tool and that is widely compatible with distinct techniques and analysis strategies.
Results and discussion
The SMD format aims to strike a balance between defining enough structure to facilitate interoperability of software packages and exchange of data between groups and providing enough flexibility to accommodate data associated with different experimental techniques and analysis use cases. The most important assumption we make is that the dataset holds traces with a fixed set of channels (e.g., raw measurements, post-processed time series, inferred kinetic trajectories, etc.) that are annotated by some set of attributes (e.g., pre-processing settings, fitted model parameters, etc.). The attributes may be quite specific to the type of experiment and analysis performed, but the channel values themselves should in general be suitable to visualization and analysis with different software packages. Figure 1B outlines how the three components of SMD are structured in the JSON notation (the top level is depicted in purple, raw data in black, and trace-specific parameters in orange). Each trace contains four fields. The values field stores the trace data where each data type is specified by a descriptive tag. The index field contains a list of row labels for the trace (typically measurement acquisition times). Any other trace-specific annotations (e.g., pre-processing settings, fitted model parameters, etc.) are placed in the attr field. Finally the id field is used to store a 32 digit hexadecimal number generated by running the MD5 algorithm on the data for each trace. The list of traces is itself stored in the data field of an outer top-level structure, which itself has a dataset-specific id (generated by running the MD5 algorithm on the entire data structure) field as well as an attr field that holds top-level annotations or summary statistics that apply to the dataset as a whole (e.g., experimental conditions, time and date of acquisition, averaged model parameters, etc.) and a desc field that contains a string describing the data set. Additionally, the dataset-specific types specifies the data type for each instance of data being stored in each set of values. A full description of the SMD specification is provided in the Additional file 1.
To facilitate the design and adoption of SMD we made the ebFRET [31,32] and SMART  single-molecule data analysis packages and visualization tools compatible with the SMD file format. We note here that ebFRET is a descendent of the previously released vbFRET [28,30] data analysis package. We also provide a number of tools for the basic support and validation of SMD files in both Matlab™ and Python packages. Full documentation of SMD and these tools is available at https://smdata.github.io.
The collaboration that resulted in SMD enabled many details that are important for ensuring generality to be implemented. The ebFRET and SMART data analysis packages were developed independently from one another and as a result have significantly different functionalities and work flows. The ability of SMD to easily accommodate these packages with multiple graphical interfaces and distinct outputs provides a strong indication that SMD will be able to accommodate the needs of many researchers.
Adoption of SMD or, as needed, a different format that encapsulates generalities not anticipated at this time, is an important step for the realization of the full potential of single-molecule measurements by and for a broad scientific community. Although it will require some discipline for researchers to abide by (or “follow”) a common set of standards, the potential long-term benefits are hard to overstate. Standardization will help facilitate the transfer of information among different labs by ensuring that a minimal structure and set of information are present. In turn, this information sharing will facilitate further critical assessment (e.g., data quality, error assessment, and reproducibility) and reanalysis of single-molecule datasets, important steps in extracting the most from complex but information-rich single-molecule data. Moreover, adoption of a common data standard could help facilitate the creation of a repository for single-molecule data (analogous to the RCSB PDB repository of biomolecular structures), which would enable a high degree of transparency and would ensure that data obtained now yields further insights in years to come. We are hopeful that the flexibility of SMD can easily accommodate the needs of current researchers and that it will enable researchers to reap the benefits that accompany widely adopted standardization.
Availability and requirements
Project name: Single-molecule dataset (SMD)
Project home page: https://smdata.github.io
Operating system: Platform independent
Programing Languages: Support provided for Matlab™ and Python, but SMD is not tied to any particular programing language.
Other requirements: none
Licenses: creative commons
Any restrictions to use by non-academics: none
Joo C, Fareh M, Kim VN: Bringing single-molecule spectroscopy to macromolecular protein complexes. Trends Biochem Sci 2013, 38:30–37.
Dulin D, Lipfert J, Moolman MC, Dekker NH: Studying genomic processes at the single-molecule level: introducing the tools and applications. Nat Rev Genet 2013, 14:9–22.
Coltharp C, Yang X, Xiao J: Quantitative analysis of single-molecule superresolution images. Curr Opin Struct Biol 2014, 28C:112–121.
Woodside MT, Block SM: Reconstructing folding energy landscapes by single-molecule force spectroscopy. Annu Rev Biophys 2014, 43:19–39.
McGinnis S, Madden TL: BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res 2004, 32(Web Server issue):W20–W25.
Brünger AT, Adams PD, Clore GM, DeLano WL, Gros P, Grosse-Kunstleve RW, Jiang JS, Kuszewski J, Nilges M, Pannu NS, Read RJ, Rice LM, Simonson T, Warren GL: Crystallography & NMR system: A new software suite for macromolecular structure determination. Acta Crystallogr D Biol Crystallogr 1998, 54(Pt 5):905–921.
Dolinski K, Ball CA, Chervitz SA, Dwight SS, Harris MA, Roberts S, Roe T, Cherry JM, Botstein D: Expanding yeast knowledge online. Yeast Chichester Engl 1998, 14:1453–1469.
Sherlock G, Hernandez-Boussard T, Kasarskis A, Binkley G, Matese JC, Dwight SS, Kaloper M, Weng S, Jin H, Ball CA, Eisen MB, Spellman PT, Brown PO, Botstein D, Cherry JM: The Stanford Microarray Database. Nucleic Acids Res 2001, 29:152–155.
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28:235–242.
Berman HM: The Protein Data Bank: a historical perspective. Acta Crystallogr A 2008, 64(Pt 1):88–95.
Tateno Y, Imanishi T, Miyazaki S, Fukami-Kobayashi K, Saitou N, Sugawara H, Gojobori T: DNA Data Bank of Japan (DDBJ) for genome scale research in life science. Nucleic Acids Res 2002, 30:27–30.
Hamm GH, Cameron GN: The EMBL data library. Nucleic Acids Res 1986, 14:5–9.
Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic Acids Res 2013, 41:D36–D42.
Bilofsky HS, Burks C: The GenBank genetic sequence data bank. Nucleic Acids Res 1988, 16(5 Pt A):1861–1863.
Tibshirani R: Big data: how to avoid a big mess.
Reducing our irreproducibility. Nature 2013, 496:398.
Tibshirani R: Immune signatures in follicular lymphoma. N Engl J Med 2005, 352:1496–1497. author reply 1496–1497.
Ioannidis JPA: Why most published research findings are false. PLoS Med 2005, 2:e124.
Begley CG, Ellis LM: Drug development: Raise standards for preclinical cancer research. Nature 2012, 483:531–533.
Prinz F, Schlange T, Asadullah K: Believe it or not: how much can we rely on published data on potential drug targets? Nat Rev Drug Discov 2011, 10:712.
Ioannidis JPA: How to make more published research true. PLoS Med 2014, 11:e1001747.
Ioannidis JPA, Greenland S, Hlatky MA, Khoury MJ, Macleod MR, Moher D, Schulz KF, Tibshirani R: Increasing value and reducing waste in research design, conduct, and analysis. Lancet 2014, 383:166–175.
Liu Y, Park J, Dahmen KA, Chemla YR, Ha T: A comparative study of multivariate and univariate hidden Markov modelings in time-binned single-molecule FRET data analysis. J Phys Chem B 2010, 114:5386–5403.
Qin F, Auerbach A, Sachs F: A direct optimization approach to hidden Markov modeling for single channel kinetics. Biophys J 2000, 79:1915–1927.
McKinney SA, Joo C, Ha T: Analysis of single-molecule FRET trajectories using hidden Markov modeling. Biophys J 2006, 91:1941–1951.
Qin F, Auerbach A, Sachs F: Hidden Markov modeling for single channel kinetics with filtering and correlated noise. Biophys J 2000, 79:1928–1944.
Watkins LP, Yang H: Information bounds and optimal analysis of dynamic single molecule measurements. Biophys J 2004, 86:4015–4029.
Bronson JE, Fei J, Hofman JM, Gonzalez RL Jr, Wiggins CH: Learning rates and states from biophysical time series: a Bayesian approach to model selection and single-molecule FRET data. Biophys J 2009, 97:3196–3205.
Greenfeld M, Pavlichin DS, Mabuchi H, Herschlag D: Single Molecule Analysis Research Tool (SMART): an integrated approach for analyzing single molecule data. PLoS One 2012, 7:e30024.
Bronson JE, Hofman JM, Fei J, Gonzalez RL Jr, Wiggins CH: Graphical models for inferring single molecule dynamics. BMC Bioinformatics 2010, 11(8):S2.
Van de Meent J-W, Bronson JE, Wiggins CH, Gonzalez RL Jr: Empirical Bayes methods enable advanced population-level analyses of single-molecule FRET experiments. Biophys J 2014, 106:1327–1337.
Van de Meent J-W, Bronson JE, Wood F, Gonzalez RL Jr, Wiggins CH: Hierarchically-coupled hidden Markov models for learning kinetic rates from single-molecule data. Proc Int Conf Mach Learn 2013, 28:361–369.
The authors would like to thank any members of the single-molecule community who take the time to adopt the SMD format. In particular we would like to thank Prof. Frederick Sacks for agreeing to make the widely used QuB analysis package compatible with the SMD format and for Prof. Taekjip Ha for agreeing to make the widely used HaMMy analysis package compatible with the SMD format. Additionally we would like to thank members of the Herschlag and Gonzalez labs as well as Prof. Aaron Hoskins (University of Wisconsin at Madison) for critical feedback. This work was supported by a NIH National Institute of General Medical Science grant P01 GM066275 to D.H.; a NSF CAREER Award (MCB 0644262) and a NIH National Institute of General Medical Sciences grant (R01 GM084288) to R.L.G.; a NIH National Centers for Biomedical Computing grant (U54CA121852) to C.H.W.; a Rubicon fellowship (680-50-1016) from the Netherlands Organization for Scientific Research (NWO) to J.W.M.; and a NIH training grant in Biotechnology (5T32GM008412) to M.G.
The authors declare that they have no competing interests.
MG, JMW, DSP, HM, CHW, RLG and DH all contributed to the inception of the project. MG, JWM and DSP carried out the design and implementation of the SMD format. MG, JMW, DSP, HM, CHW, RLG and DH all contributed to the writing of the manuscript. MG updated the SMART package to be compatible with SMD and JWM updated ebFRET to be compatible with SMD. All authors read and approved the final manuscript.
Max Greenfeld and Jan-Willem van de Meent contributed equally