- Open Access
Watchdog – a workflow management system for the distributed analysis of large-scale experimental data
© The Author(s) 2018
- Received: 24 August 2017
- Accepted: 5 March 2018
- Published: 13 March 2018
The development of high-throughput experimental technologies, such as next-generation sequencing, have led to new challenges for handling, analyzing and integrating the resulting large and diverse datasets. Bioinformatical analysis of these data commonly requires a number of mutually dependent steps applied to numerous samples for multiple conditions and replicates. To support these analyses, a number of workflow management systems (WMSs) have been developed to allow automated execution of corresponding analysis workflows. Major advantages of WMSs are the easy reproducibility of results as well as the reusability of workflows or their components.
In this article, we present Watchdog, a WMS for the automated analysis of large-scale experimental data. Main features include straightforward processing of replicate data, support for distributed computer systems, customizable error detection and manual intervention into workflow execution. Watchdog is implemented in Java and thus platform-independent and allows easy sharing of workflows and corresponding program modules. It provides a graphical user interface (GUI) for workflow construction using pre-defined modules as well as a helper script for creating new module definitions. Execution of workflows is possible using either the GUI or a command-line interface and a web-interface is provided for monitoring the execution status and intervening in case of errors. To illustrate its potentials on a real-life example, a comprehensive workflow and modules for the analysis of RNA-seq experiments were implemented and are provided with the software in addition to simple test examples.
Watchdog is a powerful and flexible WMS for the analysis of large-scale high-throughput experiments. We believe it will greatly benefit both users with and without programming skills who want to develop and apply bioinformatical workflows with reasonable overhead. The software, example workflows and a comprehensive documentation are freely available at www.bio.ifi.lmu.de/watchdog.
- Workflow management system
- High-throughput experiments
- Large-scale datasets
- Automated execution
- Distributed analysis
The development of high-throughput experimental methods, in particular next-generation-sequencing (NGS), now allows large-scale measurements of thousands of properties of biological systems in parallel. For example, modern sequencing platforms now allow simultaneously quantifying the expression of all human protein-coding genes and non-coding RNAs (RNA-seq ), active translation of genes (ribosome profiling ), transcription factor binding (ChIP-seq ), and many more. Dissemination of these technologies combined with decreasing costs resulted in an explosion of large-scale datasets available. For instance, the ENCODE project, an international collaboration that aims to build a comprehensive list of all functional elements in the human genome, currently provides data obtained in more than 7000 experiments with 39 different experimental methods . While such large and diverse datasets still remain the exception, scientific studies now commonly combine two or more high-throughput techniques for several conditions or in time-courses in multiple replicates (e.g. [5–7]).
Analysis of such multi-omics datasets is quite complex and requires a lot of mutually dependent steps. As a consequence, large parts of the analysis often have to be repeated due to modifications of initial analysis steps. Furthermore, errors e.g. due to aborted program runs or improperly set parameters at intermediate steps have consequences for all downstream analyses and thus have to be monitored. Since each analysis consists of a set of smaller tasks (e.g read quality control, mapping against the genome, counting of reads for gene features), it can usually be represented in a structured way as a workflow. Automated execution of such workflows is made possible by workflow management systems (WMSs), which have a number of advantages.
First, a workflow documents the steps performed during the analysis and ensures reproducibility. Second, the analysis can be executed in an unsupervised and parallelized manner for different conditions and replicates. Third, workflows may be reused for similar studies or shared between scientists. Finally, depending on the specific WMS, users with limited programming skills or experience with the particular analysis tools applied within the workflow may more or less easily apply complicated analyses on their own data. On the downside, the use of a WMS usually requires some initial training and some overhead for the definition of workflows. Moreover, the WMS implementation itself might restrict which analyses can be implemented as workflows in the system. Nevertheless, the advantages of WMSs generally outweigh the disadvantages for larger analyses.
In recent years, several WMS have been developed that address different target groups or fields of research or differ in the implemented set of features. The most well-known example, Galaxy, was initially developed to enable experimentalists without programming experience to perform genomic data analyses in the web browser . Other commonly used WMSs are KNIME , an open-source data analysis platform which allows programmers to extend its basic functionality by adding new Java programs, and Snakemake , a python-based WMS. Snakemake allows definition of tasks based on rules and automatically infers dependencies between tasks by matching filenames. A more detailed comparison of these WMSs is given in the Results and discussion section.
In this article, we present Watchdog, a WMS designed to support bioinformaticians in the analysis of large high-throughput datasets with several conditions and replicates. Watchdog offers straightforward processing of replicate data and easy outsourcing of resource-intensive tasks on distributed computer systems. Additionally, Watchdog provides a sophisticated error detection system that can be customized by the user and allows manual intervention. Individual analysis tasks are encapsulated within so-called modules that can be easily shared between developers. Although Watchdog is implemented in Java, there is no restriction on which programs can be included as modules. In principle, Watchdog can be deployed on any operating system.
Furthermore, to reduce the overhead for workflow design, a GUI is provided, which also enables users without programming experience to construct and run workflows using pre-defined modules. As a case study on how Watchdog can be applied, modules for read quality checks, read mapping, gene expression quantification and differential gene expression analysis were implemented and a workflow for analyzing differential gene expression in RNA-seq data was created. Watchdog, including documentation, implemented modules as well as the RNA-seq analysis workflow and smaller test workflows can be obtained at www.bio.ifi.lmu.de/watchdog.
Overview of Watchdog
Modules encapsulate re-usable components that perform individual tasks, e.g. mapping of RNA-seq data, counting reads for gene features or visualizing results of downstream analyses. Each module is declared in an XSD file containing the command to execute and the names and valid ranges of parameters. In addition to the XSD file, a module can contain scripts or compiled binaries required by the module and a test script running on example data. Module developers are completely flexible in the implementation of individual modules. They can use the programming language of their choice, include binaries with their modules or automatically deploy required software using Conda (https://conda.io/), Docker (https://www.docker.com/, an example module using a Docker image for Bowtie 2  is included with Watchdog) or similar tools. Furthermore, Watchdog provides a helper bash script to generate the XSD definition file for new modules and (if required) a skeleton bash script that only needs to be extended by the program call.
Essentially, any program that can be run from the command-line can be used in a module and several program calls can be combined in the same module using e.g. an additional bash script. In principle, a module could even contain a whole pipeline, such as Maker-P , but this would run counter the purpose of a WMS. Here, it would make more sense to separate the individual steps of the pipeline into different modules and then implement the pipeline as a Watchdog workflow. Finally, Watchdog is not limited to bioinformatics analyses, but can be also used for workflows from other domains.
The advantage of XML is that it is widely used in many contexts. Thus, a large fraction of potential Watchdog users should already be familiar with its syntax and only need to learn the Watchdog XML schema. Furthermore, numerous XML editors are available, including plugins for the widely used integrated development environment (IDE) Eclipse , which allow XML syntax checking and document structure highlighting. Finally, a number of software libraries for programmatically loading or writing XML are also available (e.g. Xerces for Java, C++ and Perl (http://xerces.apache.org/), ElementTree in Python).
In addition, Watchdog also provides an intuitive GUI (denoted workflow designer) that can be used to design a workflow, export the corresponding XML file afterwards and run the workflow in the GUI.
The core element of Watchdog that executes the workflow was implemented in Java and therefore is, in principle, platform-independent. Individual modules, however, may depend on the particular platform used. For instance, if a module uses programs only available for particular operating systems (e.g. Linux, macOS, Windows), it can only be used for this particular system.
As a first step, Watchdog validates the XML format of the input workflow and parses the XML file. Based on the XML file, an initial set of dependency-free tasks, i.e. tasks that do not depend on any other tasks, is generated and added to the WMS scheduler to execute them. Subsequently, the scheduler continuously identifies tasks for which dependencies have been resolved, i.e. all preceding tasks the task depends on have been executed successfully, and schedules them for execution. Once a task is completed, Watchdog verifies that the task finished successfully. In this case, the task generator and scheduler are informed since dependencies of other tasks might have become resolved. In case of an error, the user is informed via email (optional) and the task is added to the scheduler again but is blocked for execution until the user releases the block or modifies its parameters. Alternatively, the user may decide to skip the task or mark the error as resolved.
In the following more details are provided on principles and possibilities of workflow design in Watchdog and defining custom modules. The GUI is described in detail in Additional file 1.
Process blocks for creating subtasks
Analysis of high-throughput data often requires performing the same analysis steps in parallel for a number of samples representing different conditions or biological or technical replicates. To support these types of analyses, Watchdog uses so-called process blocks to automatically process tasks that differ only in values of parameters, e.g. short read alignment for all FASTQ files in a directory. For this purpose, process blocks define a set of instances, each of which contain one or more variables. For each instance, one subtask is created and subtask placeholders in the task definition are replaced with the variable values of the instance. For the example in which a task is executed for all FASTQ-files in a directory, each instance holds one variable containing the absolute file path of the file. The number of subtasks corresponds to the number of FASTQ-files in the directory.
Process tables (Fig. 4c) and process input (Fig. 4d) blocks can generate instances with multiple variables. Instances generated by a process table are based on the content of a tab-separated file. The rows of the table define individual instances and the columns the variables for each instance. In case of process input blocks, variables and instances are derived from return values of preceding tasks the task depends on.
Although explicit dependency definition adds a small manual overhead compared to automatic identification based on in- and output filenames as in Snakemake, it also provides more flexibility as dependencies can be defined that are not obvious from filenames. For instance, analysis of sequencing data usually involves quality control of sequencing reads, e.g. with FastQC , before mapping of reads, and users might want to investigate the results of quality control before proceeding to read mapping. However, output files of quality control are not an input to read mapping and thus this dependency could not be identified automatically. To provide more time to manually validate results of some intermediate steps, Watchdog allows adding checkpoints after individual tasks. After completion of a task with checkpoint, all dependent tasks are put on hold until the checkpoint is released. All checkpoints in a workflow can be deactivated upon workflow execution with the -disableCheckpoint flag of the Watchdog command-line version.
If a subtask B x of a task B only depends on a particular subtask A x of A instead of all subtasks of A, the definition of subtask dependencies in the workflow allows executing B x as soon as A x has finished successfully (but not necessarily other subtasks of A). This is illustrated in Fig. 8b and can be explained easily for the most simple case when the process block used for task B is a process input block containing the return values of subtasks of A. In that case, a subtask B x depends only on the subtask A x of A that returned the instance resulting in the creation of B x . The use of subtask dependencies is particularly helpful if subtasks of A need different amounts of time to finish or cannot all be executed at the same time due to resource restrictions, such as a limited amount of CPUs or memory available. In this case, B x can be executed as soon as A x has finished but before all other subtasks of A have finished. An example application would be the conversion of SAM files resulting from read mapping (task A) to BAM files (task B).
Parallel and distributed task execution
Every host that accepts secure shell connections (SSH) can be used as a remote executor (see Fig. 9d). In this case, a passphrase-protected private key for user authentication must be provided. For cluster execution, any grid computing infrastructures that implement the Distributed Resource Management Application API (DRMAA) can be utilized (see Fig. 9e). By default, Watchdog uses the Sun Grid Engine (SGE) but other systems that provide a DRMAA Java binding can also be used. Furthermore, Watchdog provides a plugin system that allows users with programming skills to add new executor types without having to change the original Watchdog code. This plugin system is explained in detail in Additional file 3 and was used to additionally implement an executor for computing clusters or supercomputers running the Slurm Workload Manager (https://slurm.schedmd.com/). The plugin system can also be used to provide support for cloud computing services that do not allow SSH. Support for the Message Passing Interface (MPI) is not explicitly modeled in Watchdog, but MPI can be used by individual modules if it is supported by the selected executor.
Finally, to allow storage of potentially large temporary files on the local hard disk of cluster execution hosts and sharing of these files between tasks, Watchdog also implements a so-called slave mode (see Fig. 9f). In slave mode, the scheduler ensures that tasks or subtasks depending on each other are processed on the same host allowing them to share temporary files on the local file system. For this purpose, a new slave is first started on an execution host, which establishes a network connection to the master (i.e. the host running Watchdog) and then receives tasks from the master for processing.
Error detection and handling
Once the task is finished, the checkers are evaluated in the same order as they were added in the XML workflow. In cases in which both success and error were detected by different checkers, the task will be treated as failed. When an error is detected, the user is informed via email notification (if enabled, otherwise the information is printed to standard output), including the name of the execution host, the executed command, the returned exit code and the detected errors.
Information on failure or success is also available via the web-interface, which then allows to perform several actions: (i) modify the parameter values for the task and restart it, (ii) simply restart the task, (iii) ignore the failure of the task or (iv) manually mark the task as successfully resolved. In case of (iii), (sub)tasks that depend on that task will not be executed, but other (sub)tasks will continue to be scheduled and executed. To continue with the processing of tasks depending on the failed task, option (iv) can be used. In this case, values of return parameters of the failed task can be entered manually via the web-interface.
Option (i) is useful if a task was executed with inappropriate parameter values and avoids having to restart the workflow at this point and potentially repeating tasks that are defined later in the workflow but are not dependent on the failed task. As Watchdog aims to execute all tasks without (unresolved) dependencies as soon as executors and resource limitations allow, these other tasks might already be running or even be finished. Option (ii) is helpful if a (sub)task fails due to some temporary technical problem in the system, a bug in a program used in the corresponding module or missing software. The user can then restart the (sub)task as soon as the technical problem or the bug is resolved or the software has been installed without having to restart the other successfully finished or still running (sub)tasks. Here, the XSD definition of a module cannot be changed during a workflow run as XSD files are loaded at the beginning of workflow execution, but the underlying program itself can be modified as long as the way it is called remains the same. Option (iii) allows to finish an analysis for most samples of a larger set even if individual samples could not be successfully processed, e.g. due to corrupt data. Finally, option (iv) is useful if custom error checkers detect a problem with the results, but the user nevertheless wants to finish the analysis.
Defining custom modules
Watchdog is shipped with 20 predefined modules, but the central idea of the module concept is that every developer can define their own modules, use them in connection with Watchdog or share them with other users. Each module consists of a folder containing the XSD module definition file and optional scripts, binaries and test scripts. It should be noted here that while the complete encapsulation of tasks within modules is advantageous for larger tasks consisting of several steps or including additional checks on in- or output, the required module creation adds some burden if only a quick command is to be executed, such as a file conversion or creation of a simple plot. However, to reduce the resulting overhead for module creation, a helper bash script is available for unix-based systems that interactively leads the developer through the creation of the XSD definition file.
For this purpose, the script asks which parameters and flags to add. In addition, optional return parameters can be specified that are required if the module should be used as process input block. If the command should not be called directly because additional functions (e.g. checks for existence of input and output files and availability of programs) should be executed before or after the invocation of the command, the helper script can generate a skeleton bash script that has to be only edited by the developer to include the program and additional function calls. Please note that modules shipped with Watchdog were created with the helper script, thus XSD files and large fractions of bash scripts were created automatically with relatively little manual overhead. Once the XSD file for a module is created, the module can be used in a workflow. By default, Watchdog assumes that modules are located in a directory named modules/ in the installation directory of Watchdog. However, the user can define additional module folders at the beginning of the workflow.
For testing and getting to know the potentials of Watchdog by first-time users, two longer example workflows are provided with the software, which are documented extensively within the XML file (contained in the examples sub-directory of the Watchdog installation directory after configuring the examples, see manual for details). All example workflows can also be loaded into the GUI in order to get familiar with its usage (see Additional file 1). In order to provide workflows that can be used for practically relevant problems, 20 modules were developed that are shipped together with Watchdog. In addition, several smaller example workflows are provided, each demonstrating one particular feature of Watchdog. They are explained in detail in the manual. The next paragraphs describe the two longer example workflows and the corresponding test dataset.
A small test dataset consisting of RNA-seq reads is included in the Watchdogexamples directory. It is a subset of a recently published time-series dataset on HSV-1 lytic infection of a human cell line . For this purpose, reads mapping to chromosome 21 were extracted for both an uninfected sample and a sample obtained after eight hours of infection. Both samples in total contain about 308,000 reads.
Workflow 1 - Basic information extraction
This workflow represents a simple example for testing Watchdog and uses modules encapsulating the programs gzip, grep and join, which are usually installed on unix-based systems by default. Processing of the workflow requires about 50MB of storage and less than one minute on a modern desktop computer. As a first step, gzipped FASTQ files are decompressed. Afterwards, read headers and read sequences are extracted into separate files. To demonstrate the ability of Watchdog to restrict the number of simultaneously running jobs, the sequence extraction tasks are limited to one simultaneous run, while the header extraction tasks are run in parallel (at most 4 simultaneously). Once the extraction tasks are finished, the resulting files from each sample are compressed and merged.
Workflow 2 - Differential gene expression
This workflow illustrates Watchdog’s potentials for running a more complex and practically relevant analysis. It implements a workflow for differential gene expression analysis of RNA-seq data and uses a number of external software programs for this purpose. Thus, although XSD files for corresponding modules are provided by Watchdog, the underlying software tools have to be installed and paths to binaries added to the environment before running this workflow. The individual modules contain dependency checks for the required software that will trigger an error if some of them are missing.
Software required by modules used in the workflow include FastQC , ContextMap 2 , BWA , samtools , featureCounts , RSeQC , R , DEseq , DEseq2 , limma , and edgeR . The workflow can be restricted to just the initial analysis steps using the -start and -stop options of the Watchdog command-line version and individual analyses steps can be in- or excluded using the -include and -exclude options. Thus, parts of this workflow can be tested without having to install all programs. Please also note that the workflow was tested on Linux and may not immediately work on macOS due to differences in pre-installed software. Before executing the workflow a few constants have to be set, which are marked as TODO in the comments of the XML file. Processing of the workflow requires about 300MB of storage and a few minutes on a modern desktop computer.
The first step is again decompression of gzipped FASTQ files. Afterwards, quality assessment is performed for each replicate using FastQC, which generates various quality reports for raw sequencing data. Subsequently, the reads are mapped to chromosome 21 of the human genome using ContextMap 2. After read mapping is completed, the resulting SAM files are converted to BAM files and BAM files are indexed using modules based on samtools. Afterwards, reads are summarized to read counts per gene using featureCounts. As methods for differential gene expression detection may require replicates, pseudo-replicates are generated by running featureCounts twice with different parameters. This was done in order to provide a simple example that can be executed as fast as possible and should not be applied when real data is analyzed. In parallel, quality reports on the read mapping results are generated using RSeQC. Finally, limma, edgeR, DEseq and DEseq2 are applied on the gene count table in order to detect differentially expressed genes. All four programs are run as part of one module, DETest, which also combines result tables of the different methods. Several of the provided modules also generate figures using R.
Comparison with other WMSs
Most WMSs can be grouped into two types based on how much programming skills are required in order to create a workflow. If a well-engineered GUI or web interface is provided, users with basic computer skills should be able to create their own workflows. However, GUIs can also restrict the user as some features may not be accessible. Hence, a second group of WMSs addresses users with more advanced programming skills and knowledge of WMS-specific programming or scripting languages.
The most well-known WMS for bioinformatic analyses is Galaxy . It was initially developed to enable experimentalists without programming experience to perform genomic data analyses in the web browser. Users can upload their own data to a Galaxy server, select and combine available analysis tools from a menu and configure them using web forms. To automatically perform the same workflow on several samples in a larger data set, so-called collections can be used.
In addition to computer resources, Galaxy provides a web-platform for sharing tools, datasets and complete workflows. Moreover, users can set up private Galaxy servers. In order to integrate a new tool, an XML-file has to be created that specifies the input and output parameters. Optionally, test cases and the expected output of a test case can be defined. Once the XML-file has been prepared, Galaxy must be made aware of the new tool and be re-started. If public Galaxy servers should be used, all input data must be uploaded to the public Galaxy servers. This is especially problematic for users with only low-bandwidth internet access who want to analyze large high-throughput datasets but cannot set up their own server.
In summary, Galaxy is a good choice for users with little programming experience who want to analyze data using a comfortable GUI, might not have access to enough computer resources for analysis of large high-throughput data otherwise, appreciate the availability of a lot of predefined tools and workflows and do not mind the manual overhead.
The Konstanz Information Miner, abbreviated as KNIME , is an open-source data analysis platform implemented in Java and based on the IDE Eclipse . It allows programmers to extend its basic functionality by adding so-called nodes. In order to create a new node, at least three interfaces must be implemented in Java: (i) a model class that contains the data structure of the node and provides its functionality, (ii) view classes that visualize the results once the node was executed and (iii) a dialog class used to visualize the parameters of the node and to allow the user to change them.
One disadvantage for node developers is that the design of the dialog is labor-intensive, in particular for nodes that accept a lot of parameters. Another shortcoming of KNIME is that only Java code can be executed using the built-in functionality. Hence, wrapper classes have to be implemented in Java if a node requires external binaries or scripts. Furthermore, KNIME does not support distributed execution in its free version. However, two extensions can be bought that allow either workflow execution on the SGE or on a dedicated server.
Hence, the free version of KNIME is not suitable for the analysis of large high-throughput data. However, KNIME can be used by people without programming skills for the analysis of smaller datasets using predefined nodes, especially, if a GUI is required that can be used to interactively inspect and visualize the results of the analysis.
A workflow processed by Snakemake  is defined as a set of rules. These rules must be specified in Snakemake’s own language in a text file named Snakefile. Similar to GNU Make, which was developed to resolve complex dependencies between source files, each rule describes how output files can be generated from input files using shell commands, external scripts or native python code. At the beginning of workflow execution, Snakemake automatically infers the rule execution order and dependencies based on the names of the input and output files for each rule. From version 2.4.8 on, dependencies can also be declared by explicitly referring to the output of rules defined further above. Workflows can be applied automatically to a variable number of samples using wildcards, i.e. filename patterns on present files.
In Snakemake, there is no clear separation between the tool library and workflow definition as the command used to generate output files is defined in the rule definition itself. Starting with version 3.5.5, Snakemake introduced re-usable wrapper scripts e.g. around command-line tools. In addition, it provides the possibility to include either individual rules or complete workflows as sub-workflows. Thus, Snakemake now allows both encapsulation of integrated tools as well as quickly adding commands directly into the workflow.
By default, no new jobs are scheduled in Snakemake as soon as one error is detected based on the exit code of the executed command. Accordingly, the processing of the complete workflow is halted until the user fixes the problem. This is of particular disadvantage if time-consuming tasks are applied on many replicates in parallel and one error for one replicate prevents execution of tasks for other replicates. While this default mode can be overridden by the –keep-going flag, this flag has to be set when starting execution of the workflow and applies globally independent of which particular parts of the workflow caused the error. In addition, the option –restart-times allows automatically restarting jobs after failure for a predefined number of times and each rule can specify how resource constraints are adapted in case of restarts. However, this option is only useful in case of random failure or failure due to insufficient resources. If errors result from incorrect program calls or inappropriate parameter values, restarting the task will only result in the same error again. Finally, Snakemake is the only one of the compared WMSs that does not provide return variables that can be used as parameters in later steps.
In summary, Snakemake is a much improved version of GNU Make. Programmers will be able to create and execute own workflows using Snakemake once they learned the syntax and semantic of the Snakemake workflow definition language. However, as Snakemake does not offer a GUI or editor for workflow design, most experimentalists without programming skills will not be able to create their own workflows.
The idea behind the WMS Nextflow  is to use pipes to transfer information from one task to subsequent tasks. In Unix, pipes act as shared data streams between two processes whereby one process writes data to a stream and another reads that data in the same order as it was written. In Nextflow, different tasks communicate through channels, which are equivalent to pipes, by using them as input and output. A workflow consists of several tasks, which are denoted as processes and are defined using Nextflow’s own language. The commands that are executed by processes can be either bash commands or defined in Nextflow’s own scripting language. Nextflow also provides the possibility to apply a task on a set of input files that follow a specific filename pattern using a channel that is filled with the filenames at runtime.
By default, all running processes are killed by Nextflow if a single process causes an error. This is particularly inconvenient if tasks with long runtimes are processed (e.g. transcriptome assembly based on RNA-seq reads). However, alternative error strategies can be defined for each task before workflow execution, which allow to either wait for the completion of scheduled tasks, ignore execution errors for this process or resubmit the process. In the latter case, computing resources can also be adjusted dynamically.
In Nextflow, there is no encapsulation of integrated tools at all since the commands to execute are defined in the file containing the workflow. While this is advantageous for quickly executing simple tasks, reusing tasks in the same or other workflows requires code duplication. Furthermore, Nextflow also does not offer a GUI for workflow design, which makes it hard for beginners to create their own workflows as they must be written in Nextflow’s own very comprehensive programming language.
In this article, we present the WMS Watchdog, which was developed to support the automated and distributed analysis of large-scale experimental data, in particular next-generation sequencing data. The core features of Watchdog include straightforward processing of replicate data, support for and flexible combination of distributed computing or remote executors and customizable error detection that allows automated identification of technical and content-related failure as well as manual user intervention.
Due to the wide use of XML, most potential users of Watchdog will already be familiar with the syntax used in Watchdog and only need to learn the semantic. This is in contrast to other WMSs that use their own syntax. Furthermore, Watchdog’s powerful GUI also allows non-programmers to construct workflows using predefined modules. Moreover, module developers are completely free in which software or programming language they use in their modules. Here, the modular design of the tool library provides an easy way for sharing modules by simply sharing the module folder.
In summary, Watchdog combines advantages of existing WMSs and provides a number of novel useful features for more flexible and convenient execution and control of workflows. Thus, we believe that it will benefit both experienced bioinformaticians as well experimentalists with no or limited programming skills for the analysis of large-scale experimental data.
Project name: Watchdog
Homepage: www.bio.ifi.lmu.de/watchdog; Bioconda package: anaconda.org/bioconda/watchdog-wms; Docker image: hub.docker.com/r/klugem/watchdog- wms/
Operating system: Platform independent
Programming language: Java, XML, XSD
Other requirements: Java 1.8 or higher, JavaFX for the GUI
License: GNU General Public License (GPL)
Any restrictions to use by non-academics: none
This work was supported by grants FR2938/7-1 and CRC 1123 (Z2) from the Deutsche Forschungsgemeinschaft (DFG) to CCF.
Availability of data and materials
Watchdog is freely available at http://www.bio.ifi.lmu.de/watchdog.
MK implemented the software and wrote the manuscript. CCF helped in revising the manuscript and supervised the project. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009; 10:57–63.View ArticlePubMedPubMed CentralGoogle Scholar
- Ingolia NT. Ribosome profiling: new views of translation, from single codons to genome scale. Nat Rev Genet. 2014; 15:205–13.View ArticlePubMedGoogle Scholar
- Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007; 316:1497–502.View ArticlePubMedGoogle Scholar
- ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489:57–74.View ArticleGoogle Scholar
- Rutkowski AJ, Erhard F, L’Hernault A, Bonfert T, Schilhabel M, Crump C, et al.Wide-spread disruption of host transcription termination in HSV-1 infection. Nat Commun. 2015; 6:7126.View ArticlePubMedPubMed CentralGoogle Scholar
- Decker TM, Kluge M, Krebs S, Shah N, Blum H, Friedel CC, et al.Transcriptome analysis of dominant-negative Brd4 mutants identifies Brd4-specific target genes of small molecule inhibitor JQ1. Sci Rep. 2017; 7:1684.View ArticlePubMedPubMed CentralGoogle Scholar
- Davari K, Lichti J, Gallus C, Greulich F, Uhlenhaut NH, Heinig M, et al.Rapid genome-wide recruitment of RNA Polymerase II drives transcription, splicing, and translation events during T cell responses. Cell Rep. 2017; 19:643–54.View ArticlePubMedGoogle Scholar
- Taylor J, Schenck I, Blankenberg D, Nekrutenko A. Using galaxy to perform large-scale interactive data analyses. Curr Protoc Bioinforma. 2007. Chapter 10:Unit 10.5.Google Scholar
- Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, et al.KNIME: The Konstanz Information Miner. In: Studies in Classification, Data Analysis, and Knowledge Organization (GfKL 2007). Heidelberg-Berlin: Springer: 2007. p. 319–26.Google Scholar
- Köster J, Rahmann S. Snakemake–a scalable bioinformatics workflow engine. Bioinformatics. 2012; 28:2520–2.View ArticlePubMedGoogle Scholar
- Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012; 9:357–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Campbell MS, Law M, Holt C, Stein JC, Moghe GD, Hufnagel DE, et al.MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations. Plant Physiol. 2014; 164:513–24.View ArticlePubMedGoogle Scholar
- McAffer J, Lemieux JM, Aniszczyk C. Eclipse Rich Client Platform, 2nd ed. Boston: Addison-Wesley Professional; 2010.Google Scholar
- Babraham, Bioinformatics Institute. FastQC: A quality control tool for high throughput sequence data. 2014. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
- Bonfert T, Kirner E, Csaba G, Zimmer R, Friedel CC. ContextMap 2: Fast and accurate context-based RNA-seq mapping. BMC Bioinformatics. 2015; 16:122.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009; 25:1754–60.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al.The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009; 25:2078–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Liao Y, Smyth G, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014; 30:923–30.View ArticlePubMedGoogle Scholar
- Wang L, Wang S, Li W. RSeQC: quality control of RNA-seq experiments. Bioinformatics. 2012; 28:2184–5.View ArticlePubMedGoogle Scholar
- R Core Team. R: A language and environment for statistical computing. Vienna; 2014. Available from: http://www.R-project.org/.
- Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010; 11:R106.View ArticlePubMedPubMed CentralGoogle Scholar
- Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014; 15:550.View ArticlePubMedPubMed CentralGoogle Scholar
- Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al.limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015; e47:43.Google Scholar
- Robinson M, McCarthy D, Smyth G. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26:139–40.View ArticlePubMedGoogle Scholar
- Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017; 35:316–9.View ArticlePubMedGoogle Scholar