From: Investigating reproducibility and tracking provenance – A genomic workflow case study
Assumptions | Recommendations |
---|---|
Availability of sufficient storage and compute resources to deal with processing of big genomics data | Workflow developers should provide complete documentation of compute and storage requirements along with the workflow to achieve long-term reproducibility of scientific results. |
Availability of high performance networking infrastructure to move bulk genomics data | Considering the size and volume of genomic data, researchers reproducing any analysis should ensure that an appropriate networking structure for data transfer is on hand |
The computing platform is preconfigured with the base software required by the workflow specification | Workflow developers should provide a mechanism with check points to ensure compatibility of the computing platform deployed by a researcher to reproduce the original analysis |
Users are responsible to ensure access to copyrighted or proprietary tools | Community should encourage work leveraging open source software and collaborative approaches thereby avoiding use of copyrighted or proprietary tools |
Analysis environment with a particular directory structure and file naming conventions is setup before executing the workflow | Workflow developers should avoid hardcoding environmental parameters such as file names, absolute file paths and directory names that would otherwise render their workflow dependent on a specific environment setup and configuration |
Appropriate datasets are used as input to the tools incorporated in the workflow | As bioinformatics analysis tools require strict adherence to input or reference file formats, data annotations and controlled access to primary data can ultimately help reproduce the workflow precisely |
Users will have a comprehensive understanding of the analysis and the provided information (in the form of incomplete workflow diagram) is sufficient to convey high level understanding of the workflow | Workflow developers should provide a complete data flow diagram serving as a blue print containing all the artefacts including tools, input data, intermediate data products, supporting resources, processes and the connection between these artefacts |
Availability of specific tool versions and setting relevant parameter space | Tools should either be packaged along with the workflow or made available via public repositories to ensure accessibility to the exact same versions and parameter settings as used in the analysis being reproduced, hence supporting flexible and customizable workflows. |
Users to have proficient knowledge of the specific reference implementation | This factor might be considered out of control of the workflow developers but detailed documentation of the underlying framework used and community support can help in overcoming the associated learning curve |