The expectation for science to be reproducible is considered fundamental but often not tested. Every new discovery in science is built on already known knowledge, that is, published literature acts as a building block for new findings or discoveries. Using this published literature as a base, the next level of understanding is developed and hence the cycle continues. Therefore, if we cannot reproduce already existing knowledge from the literature, we are wasting a lot of effort, resources and time in doing potentially wrong science [53] resulting in “reproducibility crisis” [54]. If a researcher claims a novel finding, someone else, interested in the study, should be able to reproduce it. Reports are accumulating that most of the scientific claims are not reproducible, hence questioning the reliability of science and rendering literature questionable [55, 56]. The true reproducibility of experiments in different systems has not been investigated rigorously in systematic fashion. For computational work like the one described in this paper, reproducibility not only requires an in depth understanding of science but also data, methods, tools and computational infrastructure, making it a non-trivial task. The challenges imposed by large-scale genomics data demand complex computational workflow environments. A key challenge is how can we improve reproducibility of experiments involving complex software environments and large datasets. Although this question is pertinent to scientific community as a whole [57], here we have focused on genomic workflows.
Reproducibility of an experiment often requires replication of the precise software environment including the operating system, the base software dependencies and configuration settings under which the original analysis was conducted. In addition, detailed provenance information of required software versions and parameter settings used for the workflow aids in the reusability of any workflow. Provenance tracking and reproducibility go hand in hand as provenance traces contribute to make any research process auditable and results verifiable [58]. The variant calling workflows (as our case study) result in genetic variation data that serves to enhance understanding of diseases when translated into a clinical setting resulting in improved healthcare. Keeping in view the critical application of the data generated, it is safe to state that entire process leading to such biological comprehensions must be documented systematically to guarantee reproducibility of the research. However a generalised set of rules and recommendations to achieve this is still a challenge to be met as workflow implementation, storage, sharing and reuse significantly varies depending on the choice of approach and platform used by the researcher. A common phenomenon to every approach however is ‘workflow decay’ [59] caused by the factors such as the evolution of technical environment used to implement a workflow, updates in the state of external factors such as databases and unavailability of third party web resources. Our study contributes to understanding the requirements of reproducibility of genomic workflows by investigating a set of assumptions evident from practical implementation of the case study and providing standardised recommendations for computational genomic workflow studies.
Owing to the production of exceptional amounts of genomics data, a typical human exome sequence analysis (for example the current case study) would require a terabyte of storage and up to 64GB RAM of compute power. As the computational dependencies of workflows have grown complex from simple batch execution to distributed and parallel processing, researchers should document and provide the amount of storage and compute power required by a workflow to run successfully. Long term reproducibility of scientific results can be hard to achieve if the appropriate resources required to reproduce the workflow are not fully declared. Apart from declaration of compute and storage resources required to successfully execute a workflow, comprehensive efforts by workflow developers could result in better management of dependencies. A tool or a workflow built on a specific computing platform requires the details of the exact version of the underlying base software to execute successfully. One example is a requirement of a particular version of Java (1.8) to execute tools from GATK or Picard toolkit used in a workflow. The absence of such information about the base software requirements such as Java or Python would result in at least one unsuccessful execution of the workflow. We recommend workflow developers devise a mechanism (e.g. provide a script) that should implement checkpoints to analyse the suitability of computing platform before the execution attempt. This will ideally guide the researchers trying to reproduce a workflow who otherwise would waste considerable time tackling the ‘dependency hell’. The burden obviously shifts to the workflow developers but in the longer run, it would be helpful to declare and document the very basic information, which is considered too obvious to state.
Genomic data analysis has grown complex with the increased involvement of customized scripts and online resources needed to carry out difficult tasks, increasing both the technical knowledge required and the chance that something will break. One of the major reasons for non-reproducibility of workflows is use of volatile third party resources such as databases, tools or websites [59]. Many workflows cannot be run because the third party resources they rely on are no longer available and the results could only be reproduced using the specific version of the software, hence rendering workflows unusable. These factors can be considered out of control of the researchers as every time an analysis is repeated, it may assume that the system it is being reproduced on comes preconfigured with all the workflow dependencies. Also, the download of large genomic datasets from the third party online resources demands users ensure availability of high performance networking infrastructure on their part. Volatile third party resources is an open end problem to which several solutions have been proposed, such as alternative resources or local copy of the resource, to mitigate the consequences [60]. However, we believe that alternative resources might not result in the same output, hence a barrier to reproducibility of results [19]. The services hosting third party resources are generally in no agreement to continuously supply these resources. Even the most sophisticated and widely used technologies such as container based approaches require connection to the network and online resources at least once for building of the required software components.
Third party services such as copyrighted or proprietary resources should be avoided in research involving use of genomic datasets as they can result in an inability to access original resources or tools, overshadow the ramifications of the research and halt reproducibility. The possible solutions to reproduce the research involving such tools can be through buying or re-implementing these software, which is often not a realistic expectation. Instead the community should push forward to work towards open source software and collaborative science [61], which makes it easier to communicate and access scientific knowledge. The efforts such as Centre for Open ScienceFootnote 7 are working towards encouraging openness and reproducibility of scholarly research, hence accelerating scientific progress.
Additionally, explicit requirements for specific analysis environment, e.g. hard coded paths and names embedded in source code, should be avoided in the pipeline definition. In our case study, creation of an analysis environment with a particular directory and file naming convention was required by Cpipe to execute the workflow successfully [33]. From our experience, we recommend that this should not be a rule as it adversely affects the portability of the workflow. An extra responsibility on a researcher reproducing someone else’s workflow is to define the analysis environment and related parameters. We recommend avoiding the hard coding file names, absolute file paths, host names, user names and IP addresses. Workflow developers should ensure their workflows are independent of a specific analysis environment to allow their workflows to be more readily executable.
In principle, workflow management systems such as Galaxy use a publicly shared repository for published tools and workflows. In practise this is a challenge, as there are many ways to set up the analysis to begin with. Galaxy allows the users to choose the computing platform such as centralised public galaxy, galaxy on cloud or as a localised instance. There are more than 80 publically shared galaxy serversFootnote 8 each containing different toolsets. Workflow developers can create a workflow using their localised instances and later publish these workflows assuming uniformity of tool repositories across different platforms. This can result in static and inflexible solutions, hence challenging to be reproduced as it assumes uniformity of repositories across different platform instances. The workflow developers are recommended to ensure the availability of the tools used in the workflow implemented on local instances of any workflow management system. These tools should either be shared via repositories associated with a certain workflow system or using open source code sharing solutions e.g. through a git repository. The repository maintainers should make the process of adding tools to centralized repositories straightforward and easy to implement. This would result in cost effective analysis encouraging researchers to reuse the resources provided instead of reinventing the wheel.
Input such as sequencing reads in FASTA files and reference datasets play a major role to enable reproducibility of genomic workflows and ultimately achieve repeatable results. Even in the case where the user has comprehensive understanding of the workflow analysis, absence of input data annotations hinders the successful execution of the workflow. Analysis tools usually require strict adherence to file formats (e.g. reference sequence should be a single reference sequence in FASTA format or the names and order of the contigs in the reference used must exactly match that of one of the official reference canonical orderings). This demands providing access to primary data used in the analysis. However, a major implication of this idea lies in the security and ethical consideration of genomics data. The community needs to address this issue by providing secure controlled access to sensitive genomic data. Also, the size of the genomic datasets can be a problem in sharing the datasets and providing them to workflow specifications. In such cases, where it is not possible to package or share datasets with the workflow, comprehensive annotations will assist researchers to decide on the appropriate datasets for the workflow. Public repositoriesFootnote 9
,
Footnote 10 and resources can also be used to archive, preserve and share genomic datasets.
With ever evolving repositories, services, tools and data, workflow specification alone is rarely sufficient to ensure reproducibility and reusability of scientific experiments, resulting in workflow decay. One way to avoid the workflow decay is to provide complete provenance capture including annotations for every process during workflow execution, the parameters and links to third party resources including data and external software services. This information should be available with the published workflow. The relevant parameter setting for each tool used in an analysis is also essential to ensure reproducibility of results hence should be provided with the workflow. Alternatively workflow developers should package all associated tools when the workflow is published. Workflows should be treated as first class data objects [62] and container technologies such as Docker, OpenVZFootnote 11 or LXCFootnote 12containers should be used to package the environment and configuration together.
Approaches such as CWL utilise Docker containers, work on the principles of comprehensive declaration and make minimal internal assumptions about the precise software environment, base software dependencies, configuration settings, alteration of parameters and software versions. Such approaches aim to build flexible and customized workflows including intricate details of every process in a workflow such as requirement declarations for the runtime environment, data and metadata, input and output parameters and command-line executable. This results in archiving of the entire framework of the software environment that can be re-established to support reproducibility. However, working with this kind of approach is not an easy task and requires lots of time, efforts and substantial technical support (in our case study this was provided by the CWL team) to first learn the principles of the language and then coding to implement system configuration of a complex genome analysis workflow.
Hence, the details vital to reproducibility of any computational genomic analysis should be completely documented to ensure capture of critical provenance information. From our experience gained from this study we posit that the workflow developers along with other mechanisms should collectively document the important pieces of information through graphical representation of the workflow as indicated in Fig. 3. The flowchart in the figure can be used as a model to record a high level representation of the underlying complex workflow. It is a blueprint containing all the artefacts including tools, input data, and intermediate data products, supporting resources, processes and connections between these artefacts. To re-enact any workflow the users should be directed to explicitly understand and declare all the requirements mentioned in such workflow representation. The proposed representation of the variant calling workflow shown (Fig. 3) contains all the necessary artefacts needed to support reproducibility requirements and provenance tracking across the platforms. The concept of visual representation of the workflow is implemented in only a few GUI based workbenches [27, 30, 63] but such high level representation depicts an inadequate illustration of the analysis as evident from Fig. 4.
During this study we observed that the ultimate Galaxy workflow diagram does not state the utilization of some tools such as BWA Index, SAMtools View, SAMtools Sort, SAMtools Faidx and Picard CreateSequenceDictionary. Therefore the incomplete Galaxy workflow diagram (Fig. 4) is challenging to be reproduced on other platforms, as necessary information about each step is not recorded. Hence, platforms making assumptions about some aspects of a workflow without documenting them as part of final workflow diagram result in incomplete understanding of the reproducibility requirements.
The workflows used to implement biomedical data analyses have grown complex [64] making it difficult to understand and reproduce such experiments. A graphical representation (Fig. 3) allows visualization of multiple aspects of workflow definition and implementation including data manipulation and interpretation. Enabling simplicity by representing complex workflows in human readable formats can significantly reduce the complexity of such analyses through improved understanding. As the studies involving complex analysis tasks encompass human judgments, it is important that the research community works in this direction to help researchers transfer their knowledge and expertise using proposed rich and easy to create representations. Further, the proposed human readable description (along with the machine readable ones) can help identify bottlenecks in the analysis and ultimately accelerate reproducibility of data driven sciences.