Skip to main content

Our strategy to achieve and document reproducible computing

Background

The scientific and ethical importance of reproducible computing in analysis and interpretation of biomedical research data is now widely recognized. However, achieving and documenting reproducible computing is very challenging in a perpetually evolving research environment in which multiple users perform analyses of multiple data files on multiple platforms.

Materials and methods

Here, we describe our three-component strategy to achieve and document permanent reproducible computing in our research environment. First, we use the Sweave literate programming infrastructure to embed R code and report text in the same file. Sweave performs the specified calculations in R, inserts those results directly into a LaTeX typesetting command file, and finally compiles the LaTeX typesetting file into a PDF file. Thus, a Sweave file internally documents the top-level R code that produces the reported results. However, a Sweave report does not retain its reproducibility if the input data files and lower-level R code are modified later. Therefore, as the second component of our strategy, we developed the Igloo system to archive and freeze files for permanent reproducibility. The Igloo system requests that the user document every file that is transferred to a frozen archive. The Igloo system freezes the files in an archive with a directory structure that annotates the files by research team (leukemia, brain tumor, etc) and category (code file, type of data file, etc). The archive directory is visible in our Windows and Linux high-performance computing environments and has permission controls to ensure appropriate access to the files. However, neither Sweave nor Igloo assists with the cumbersome task of identifying specific input files that should be frozen to ensure permanent reproducibility. As the third component of our strategy, we developed the R package rctrack that computationally tracks the accession and generation of files by an R analysis program. The rctrack package defines a function that identifies files which need to be frozen in order to ensure permanent reproducibility. Additionally, rctrack provides mechanisms to track and document the usage of other software for some calculations. Finally, the rctrack package defines a function that generates a Sweave appendix with details regarding the input data and code files and their impact on the reproducibility of the report.

Results

By using and further enhancing these tools, we expect to achieve and document permanent and complete reproducibility of all our analyses in the very near future.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stan Pounds.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( https://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Enyinda, N., Liu, Z., Negatu, A. et al. Our strategy to achieve and document reproducible computing. BMC Bioinformatics 14 (Suppl 17), A19 (2013). https://doi.org/10.1186/1471-2105-14-S17-A19

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-14-S17-A19

Keywords