Investigating reproducibility and tracking provenance – A genomic workflow case study

Kanwal, Sehrish; Khan, Farah Zaib; Lonie, Andrew; Sinnott, Richard O.

doi:10.1186/s12859-017-1747-0

BMC Bioinformatics

Table 1 Summary of assumptions (detailed in section Workflow enactment using the selected systems) and corresponding recommendation for reproducibility

From: Investigating reproducibility and tracking provenance – A genomic workflow case study

Assumptions	Recommendations
Availability of sufficient storage and compute resources to deal with processing of big genomics data	Workflow developers should provide complete documentation of compute and storage requirements along with the workflow to achieve long-term reproducibility of scientific results.
Availability of high performance networking infrastructure to move bulk genomics data	Considering the size and volume of genomic data, researchers reproducing any analysis should ensure that an appropriate networking structure for data transfer is on hand
The computing platform is preconfigured with the base software required by the workflow specification	Workflow developers should provide a mechanism with check points to ensure compatibility of the computing platform deployed by a researcher to reproduce the original analysis
Users are responsible to ensure access to copyrighted or proprietary tools	Community should encourage work leveraging open source software and collaborative approaches thereby avoiding use of copyrighted or proprietary tools
Analysis environment with a particular directory structure and file naming conventions is setup before executing the workflow	Workflow developers should avoid hardcoding environmental parameters such as file names, absolute file paths and directory names that would otherwise render their workflow dependent on a specific environment setup and configuration
Appropriate datasets are used as input to the tools incorporated in the workflow	As bioinformatics analysis tools require strict adherence to input or reference file formats, data annotations and controlled access to primary data can ultimately help reproduce the workflow precisely
Users will have a comprehensive understanding of the analysis and the provided information (in the form of incomplete workflow diagram) is sufficient to convey high level understanding of the workflow	Workflow developers should provide a complete data flow diagram serving as a blue print containing all the artefacts including tools, input data, intermediate data products, supporting resources, processes and the connection between these artefacts
Availability of specific tool versions and setting relevant parameter space	Tools should either be packaged along with the workflow or made available via public repositories to ensure accessibility to the exact same versions and parameter settings as used in the analysis being reproduced, hence supporting flexible and customizable workflows.
Users to have proficient knowledge of the specific reference implementation	This factor might be considered out of control of the workflow developers but detailed documentation of the underlying framework used and community support can help in overcoming the associated learning curve

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com