The ability to reproduce research results is a cornerstone of the scientific method. Research results that are not reproducible are generally considered invalid by the scientific community. The result of a laboratory investigation must be independently reproduced by others to be considered valid. Similarly, the validity of a clinical research finding is established via recapitulation in multiple distinct clinical research cohorts.
The exciting advances in data collection biotechnologies (microarrays, sequencing, proteomics, etc) present the biomedical research community with special challenges related to data interpretation and reproducibility of research results. The interpretation of massive data sets requires the use of sophisticated computational and statistical analysis methods. The data sets themselves evolve as new technologies are introduced, additional tissue samples become available, clinical follow-up is updated, etc. A specific analysis result may be a function of dozens of data files, program files, and other details. Without carefully tracking details about how those files are used to generate a specific result, it can be challenging if not impossible to describe how any particular research result can be recapitulated from the study data. This challenge represents a crisis for reproducibility as a pillar of the scientific method. Consequently, some journals are developing policies to encourage computational reproducibility [1] and federal authority to audit the computational reproducibility of research results is being expanded [2]. Most recently, the National Cancer Institute issued guidelines that strongly encourage investigators to maintain high standards of computational reproducibility in their ‘omics’ studies [3, 4].
Also, some recent mishaps in clinical cancer genomics research have shown that there is an ethical obligation to ensure that computational results are fully reproducible. The first indication of problems in a series of clinical trials was the inability to computationally reproduce the results of the supporting scientific publications from the publicly available data sets [5, 6]. Investigative follow-up of the analysis result discrepancies showed that the supporting science was fundamentally flawed [7]. Subsequently, the publications have been retracted and the clinical trials closed [8]. Thus, it is ethically imperative that data analysis results be computationally reproducible before they are used to direct therapy in a clinical trial [2, 9].
Several computational tools have been developed to facilitate reproducible computing [10]. For example, Sweave[11] uses R (http://www.r-project.org) to compute statistical results and inserts them into a LATE X (http://www.latex-project.org/) typesetting program that is subsequently converted into a PDF report. To enable this functionality, Sweave defines a special syntax to switch between R code and LATE X code within the same file. In this way, the Sweave program file directly documents the top-level R code used to generate the corresponding PDF report. Similarly, the R packages knitr [12] and lazyWeave [13] enable one to computationally insert results determined with R into a Wiki mark-up file and an open office document file, respectively. These literate programming [14] systems internally document exactly how the results in the report file were produced. These literate programming environments significantly enhance the scientific community’s ability to perform reproducible computing.
However, literate programming systems fail to address other challenges to reproducible computing. As previously mentioned, one analysis result may be a function of a very large number of input data files (such as microarray image files, sequence alignment files, etc) and computer program files (such as the primary analysis program, some locally developed R routines, specific versions of R packages, etc). These files must be collected and archived to ensure that the result may be reproduced at a later date. Manually reading through a program to identify and collect all those files is very tedious and cumbersome. Thus, there is a need to computationally collect and archive those files.
Some tools have been developed to support reproducible computing by automatically archiving files. CDE software (http://www.pgbovine.net/cde.html) automatically generates an archive folder that replicates the entire directory structure for every file used to execute a specific Linux command so that the identical command can be executed on another Linux machine without any conflicts [15]. The technical completeness of the CDE archive is very impressive and CDE is very helpful for many Linux users who wish to ensure the computational reproducibility of their work.
Nevertheless, CDE has several limitations and drawbacks. Obviously, CDE is not helpful for people who use operating systems other than Linux and Unix. Also, the archive folder of CDE is excessively redundant for some settings. For example, the CDE archive includes a copy of many installation files for every software program (R, MatLab, SAS, Java, etc) used to perform an analysis. As CDE is used to document many analyses for multiple projects by a group of users that all have access to the same programs (such as a department of biostatistics or computational bioloy), the redundancy of such program files in CDE archives will begin to unnecessarily strain the storage resources of the computing infrastructure. This redundancy is amplified when the same large data files (such as genotype data files for genome-wide association studies) are present in multiple CDE archives. Additionally, collecting a large number of copies of proprietary software files in many CDE archives may inadvertently pose some legal problems. Furtermore, although the CDE archive is complete from a technical computing perspective, it still does not necessarily contain all the information necessary to exactly recapitulate the results of statistical analysis procedures such as bootstrapping, permutation, and simulation that rely on random number generation [16]. The initial seed and the random number generator must be retained for such analyses to be fully reproducible and such information will not be stored in the CDE archive if the program files did not explicitly set the value for the initial seed. Finally, CDE does not automatically generate any file to help a reviewer to understand the computational relationships among files such as indicating that specific program files generated specific result files. Understanding such computational relationships is necessary to critically evaluate the appropriateness of the methods and the scientific validity of the results.
We have developed the package rctrack (Additional file 1, also available from our website listed below) to automatically archive data files, code files, and other details to support reproducible computing for R software that is widely used for statistical analysis of biomedical data and is freely available for the Unix/Linux, Windows, and MacOS operating systems. The rctrack package may be used in conjunction with literate programming R packages (Sweave, knitR) and does not require the user to modify previously written R programs. It automatically documents details regarding random number generation and a sketch of the R function call stack at the time that data, code, or result files are accessed or generated. These details are saved in an archive that includes additional files that have been automatically archived according to default or user-specified parameters to control the contents and size of the archive. The files in this archive can be used to develop a custom R package or other set of files to help others recapitulate the analysis results at a later date. Finally, the rctrack package also provides functions to audit one archive and to compare two archives at a later date.