Skip to main content

Table 1 Summary of assumptions (detailed in section Workflow enactment using the selected systems) and corresponding recommendation for reproducibility

From: Investigating reproducibility and tracking provenance – A genomic workflow case study

Assumptions

Recommendations

Availability of sufficient storage and compute resources to deal with processing of big genomics data

Workflow developers should provide complete documentation of compute and storage requirements along with the workflow to achieve long-term reproducibility of scientific results.

Availability of high performance networking infrastructure to move bulk genomics data

Considering the size and volume of genomic data, researchers reproducing any analysis should ensure that an appropriate networking structure for data transfer is on hand

The computing platform is preconfigured with the base software required by the workflow specification

Workflow developers should provide a mechanism with check points to ensure compatibility of the computing platform deployed by a researcher to reproduce the original analysis

Users are responsible to ensure access to copyrighted or proprietary tools

Community should encourage work leveraging open source software and collaborative approaches thereby avoiding use of copyrighted or proprietary tools

Analysis environment with a particular directory structure and file naming conventions is setup before executing the workflow

Workflow developers should avoid hardcoding environmental parameters such as file names, absolute file paths and directory names that would otherwise render their workflow dependent on a specific environment setup and configuration

Appropriate datasets are used as input to the tools incorporated in the workflow

As bioinformatics analysis tools require strict adherence to input or reference file formats, data annotations and controlled access to primary data can ultimately help reproduce the workflow precisely

Users will have a comprehensive understanding of the analysis and the provided information (in the form of incomplete workflow diagram) is sufficient to convey high level understanding of the workflow

Workflow developers should provide a complete data flow diagram serving as a blue print containing all the artefacts including tools, input data, intermediate data products, supporting resources, processes and the connection between these artefacts

Availability of specific tool versions and setting relevant parameter space

Tools should either be packaged along with the workflow or made available via public repositories to ensure accessibility to the exact same versions and parameter settings as used in the analysis being reproduced, hence supporting flexible and customizable workflows.

Users to have proficient knowledge of the specific reference implementation

This factor might be considered out of control of the workflow developers but detailed documentation of the underlying framework used and community support can help in overcoming the associated learning curve