A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines
© Cieślik and Mura; licensee BioMed Central Ltd. 2011
Received: 20 December 2010
Accepted: 25 February 2011
Published: 25 February 2011
Bioinformatic analyses typically proceed as chains of data-processing tasks. A pipeline, or 'workflow', is a well-defined protocol, with a specific structure defined by the topology of data-flow interdependencies, and a particular functionality arising from the data transformations applied at each step. In computer science, the dataflow programming (DFP) paradigm defines software systems constructed in this manner, as networks of message-passing components. Thus, bioinformatic workflows can be naturally mapped onto DFP concepts.
To enable the flexible creation and execution of bioinformatics dataflows, we have written a modular framework for parallel pipelines in Python ('PaPy'). A PaPy workflow is created from re-usable components connected by data-pipes into a directed acyclic graph, which together define nested higher-order map functions. The successive functional transformations of input data are evaluated on flexibly pooled compute resources, either local or remote. Input items are processed in batches of adjustable size, all flowing one to tune the trade-off between parallelism and lazy-evaluation (memory consumption). An add-on module ('NuBio') facilitates the creation of bioinformatics workflows by providing domain specific data-containers (e.g., for biomolecular sequences, alignments, structures) and functionality (e.g., to parse/write standard file formats).
PaPy offers a modular framework for the creation and deployment of parallel and distributed data-processing workflows. Pipelines derive their functionality from user-written, data-coupled components, so PaPy also can be viewed as a lightweight toolkit for extensible, flow-based bioinformatics data-processing. The simplicity and flexibility of distributed PaPy pipelines may help users bridge the gap between traditional desktop/workstation and grid computing. PaPy is freely distributed as open-source Python code at http://muralab.org/PaPy, and includes extensive documentation and annotated usage examples.
Workflows are a natural model of how researchers process data , and will therefore only gain in relevance and importance as science continues becoming more data- and information-intensive. Unlike business workflows, which emphasize process modeling, automation and management, and are control-flow oriented [2, 3], scientific pipelines emphasize data-flow, and fundamentally consist of chained transformations of collections of data items. This is particularly true in bioinformatics (see, e.g.,  and references therein), spurring the recent development of workflow managment systems (WMS) to standardize, modularize, and execute in silico protocols. Such systems generally enable the construction, automation, deployment, exchange, re-use, and reproducibility of data-processing/analysis tasks ; catalogs of bioinformatically-capable WMS and web service (WS)-related systems can be found in relatively recent reviews [6, 7].
The feature sets of existing WMS solutions vary in terms of monitoring, debugging, workflow validation, provenance capture, data management and scalability. While some WMS suites (e.g., BIO WMS , ERGATIS) and pipelining solutions (e.g., CYRILLE 2 ) are tailored to the bioinformatics domain, many serve as either general-purpose, domain-independent tools (e.g., KEPLER and its underlying PTOLEMY II system , TAVERNA, KNIME), frameworks for creating abstracted workflows suitable for enactment in grid environments (e.g., PEGASUS), high-level "enactment portals" that require less programming effort by users (e.g., BIOWEP), or flower-level software libraries (e.g., the Perl-based BIOPIPE). Indeed, the recent proliferation of WMS technologies and implementations led Deelman et al. to systematically study a taxonomy of features", particularly as regards the four stages of a typical workflow's lifecycle - creation, mapping to resources, execution, and provenance capture. The division into task-based versus service-based systems appears to be fundamental . Systems of the first kind emphasize the orchestration and execution of a workflow, while the latter focus on service discovery and integration. With its emphasis on enabling facile creation of Python-based workflows for data processing (rather than, e.g., WS discovery or resource brokerage), PaPy is a task-based tool.
Traditional, non-WMS solutions for designing, editing, and deploying workflows are often idiosyncratic, and require some form of scripting to create input files for either a Make-like software build tool or a compute cluster task scheduler. Such approaches are, in some regards, simpler and more customizable, but they lack the aforementioned benefits of workflow systems; most importantly, manual approaches are brittle and in flexible (not easily sustainable, reconfigurable, or reusable), because the data-processing logic is hardwired into 'one-off ' scripts. At the other extreme, a common draw-back of integrated WMS suites is that, for transformations outside the standard repertoire of the particular WMS, a user may need to program custom tasks with numerous (and extraneous) adaptor functions ('shims' [17, 18]) to finesse otherwise in-compatible data to the WMS-specific data-exchange format. This, then, limits the general capability of a WMS in utilizing ('wrapping') available codes to perform various, custom analyses. PaPy is a Python programming library that balances these two extremes, making it easy to create data-processing pipelines. It provides many of the benefits of a WMS (modular workflow composition, ability to distribute computations, monitoring execution), but preserves the simplicity of the Make-style approach and the flexibility of a general-purpose programming language. (PaPy-based workflows are written in Python.) The application programming interface (API) of PaPy reflects the underlying flow-based programming paradigm , and therefore avoids any impedance mismatch"  in expressing workflows. This enables PaPy to expose a compact, yet flexible and readily extensible, user interface.
Flow-based programming (FBP) and related approaches, such as dataflow programming languages , define software systems as networks of message-passing components. Discrete data items pass (as 'tokens') between components, as specified by a connection/wiring diagram; the runtime behavior (concurrency, deadlocks, etc.) of such systems may be analyzed via formal techniques such as Petri nets . Most importantly for bioinformatics and related scientific domains, the individual pipeline components are coupled only by virtue of the pattern of data traversal across the graph and, therefore, the functions are highly modular, are insulated from one another, and are re-usable. The connections are defined independently of the processing components. Thus, flow-based programs can be considered as (possibly branched) data-processing assembly lines. Dataflow programming lends itself as a model for pipelining because the goal of modular data-processing protocols maps naturally onto the concept of components and connections. The input stream to a component consists of self-contained (atomic) data items; this, together with loose coupling between processing tasks, all flows for relatively easy parallelism and, consequently, feasible processing of large-scale datasets.
In PaPy, workflows are built from ordinary, user-definable Python functions with specific, well-defined signatures (call/return semantics). These functions define the operations of an individual PaPy processing node, and can be written in pure Python or may 'wrap' entirely non-Python binaries/executables. Thus, there are literally no arbitrary constrains on these functions or on a PaPy pipeline, in terms of functional complexity, utilized libraries or wrapped third-party programs. In this respect, PaPy is agnostic of specific application domains (astronomy, bioinformatics, cheminformatics, etc.). An auxiliary, independent module ('NuBio') is also included, to provide data-containers and functions for the most common tasks involving biological sequences and structures.
PaPy has been implemented as a standard, cross-platform Python (CPython 2.6) package; the Additional File 1 (§3.1) provides further details on PaPy's platform independence, in terms of software implementation and installation. PaPy's dataflow execution model can be described, in the sense of Johnston et al., as a demand-driven approach to processing of data streams. It uses the multiprocessing package  for local parallelism (e.g., multi-core or multi-CPU workstations), and a Python library for remote procedure calls (RPyC ) to distribute computations across networked computers. PaPy was written using the dataflow and object-oriented programming paradigms, the primary design goal being to enable the logical construction and deployment of workflows, optionally using existing tools and code-bases. The resulting architecture is based on well-established concepts from functional programming (such as higher-order 'map' functions) and workflow design (such as directed acyclic graphs), and naturally features parallelism, arbitrary topologies, robustness to exceptions, and execution monitoring. The exposed interface all flows one to define what the data-processing components do (workflow functionality), how they are connected (workflow structure) and where (upon what compute resources) to execute the workflow. These three aspects of PaPy's functionality are orthogonal, and therefore cleanly separated in the API. This construction promotes code re-use, clean workflow design, and alllows de-ployment in a variety of heterogenous computational environments.
Overview of the PaPy package
Provides the core objects and methods for workflow construction and deployment, including the Worker, Piper, Dagger, and Plumber classes (see Table 2).
Supplies an extension of Python's 'imap' facility, enabling parallel/distributed execution of tasks, locally or remotely (see Fig. 1B).
Provides data-structures and methods specific to bioinformatic data (molecular sequences, alignments, phylogenetic trees, 3 D structures)
PaPy's core components (classes) and their roles
Description & function
The core components (processing nodes) of a pipeline. User-defined functions (or external programs) are wrapped as Worker instances; a Piper wraps a Worker and, in conjunction with numap, further species the mode of evaluation (serial/parallel, local/remote, etc.); these key pipeline elements also provide exception-handling, logging, and produce/spawn/consume functionality.
Defines the data-flow pipeline in the form of a DAG; allows one to add, remove, connect pipers, and validate topology. Coordinates the starting/stopping of NuMaps.
High-level interface to run & monitor a pipeline: Provides methods to save/load pipeline code, alter and monitor state (initiate/run/pause/stop/etc.), and save results. (See Additional file 1 §3.2 for more information on the subtle differences between the Plumber and Dagger classes.)
Implements a process/thread worker-pool. Allows pipers to evaluate multiple, nested map functions in parallel, using a mixture of threads or processes (locally) and, optionally, remote RPyC servers.
The 'numap' module supplies a parallel execution engine, using a flexible worker-pool  to evaluate multiple map functions in parallel. Used together with papy, these maps comprise some (or all) of the processing nodes of a pipeline. Like a standard Python 'imap', numap applies a function over the elements of a sequence or iterable object, and it does so lazily. Laziness can be adjusted via 'stride' and 'buffer' arguments (see below). Unlike imap, numap supports multiple pairs of functions and iterable tasks. The tasks are not queued, but rather are interwoven and share a pool of worker 'processes' or 'threads', and a memory 'buffer'. Thus, numap's parallel (thread- or process-based, local or remote), buffered, multi-task functionality extends standard Python's built-in 'itertools.imap' and 'multiprocessing.Pool.imap' facilities.
Because the architecture of PaPy is generalized, it is more of a software library than a single, domain-specific program, and it is therefore able to drive arbitrary workflows (bioinformatic or not). To enable rapid, consistent development, and facile deployment, of bioinformatics workflows, a lightweight package ('NuBio') is provided. NuBio consists of data structures to store and manipulate biological entities such as molecular sequences, alignments, 3 D structures, and trees. The data containers are based on a hierarchical, multi-dimensional array concept. Raw data are stored in at arrays, but the operational (context-dependent) meaning of a data-item is defined at usage, in a manner akin to NumPy's view casting" approach to subclassing n-dimensional arrays (see  and Example 2 below). For example, the string object 'ATGGCG' can act as a 'NtSeq' (sequence of six nucleotides) or as a 'CodonSeq' (sequence of two codons) in NuBio. This alllows one to customize the behaviour of objects traversing the workflow and the storage of metadata at multiple hierarchical levels. Functions to read and write common file formats are also bundled in PaPy (PDB for structural data, FASTA for sequences, Stockholm for sequence alignments, etc.).
Parallel data-processing is an important aspect of workflows that either (i) deal with large datasets, (ii) involve CPU-intensive methods, or (iii) perform iterated, loosely-coupled tasks, such as in "parameter sweeps" or replicated simulations. Examples in computational biology include processing of raw, 'omics'-scale volumes of data (e.g. ), analysis/post-processing of large-scale datasets (e.g. molecular dynamics simulations in ), and computational approaches that themselves generate large volumes of data (e.g. repetitive methods such as replica-exchange MD simulations [29, 30]). PaPy enables parallelism at the processing node and data-item levels. The former (node-level) corresponds to processing independent data items concurrently, and the latter (item level) to running parallel, independent jobs for a single data item.
Data-handling and serialization issues
Pipers must communicate the results computed by their wrapped functions. In PaPy's execution model, synchronization and message passing within a workflow are achieved by means of queues and locked pipes in the form of serialized Python objects. (Serialization refers to a robust, built-in means of storing a native Python object as a byte-string, thereby achieving object persistence.) Unlike heavyweight WMS suites such as KNIME (see the Additional File 1 §4), PaPy does not enforce a specific rigorous data exchange scheme or file format. This intentional design decision is based on the type system of the Python programming language, whereby the valid semantics of an object are determined by its dynamic, user-modifiable properties and methods ("duck typing"). Such potentially polymorphic data structures cannot be described by, e.g., XML schema , but serialization offers a method of losslessly preserving this flexible nature of Python objects. In PaPy, component interoperability is achieved by adhering to duck-typing programming patterns. By default, no intermediate pipeline results are stored. This behavior can be easily changed by explicitly adding Piper nodes for data serialization (e.g. JSON) and archiving (e.g. files) anywhere within a workflow.
Inter-process communication (IPC)
PaPy's interprocess communication (IPC) methods
Communication, via TCP sockets, between hosts connected within a computer network
Communication between processes on a single host
The file storage location must be accessible by all processes (e.g., over an NFS or Samba share).
Interactive, real-time viewing of execution progress is valuable for parallel programs in general (e.g. for purposes of debugging), and it is particularly useful in workflow execution and editing to be able to log component invocations during the workflow lifecycle . The information should be detailed enough to allow troubleshooting of errors and performance issues or auditing, and is a key aspect of the general issue of data provenance (data and metadata recording, management, workflow reproducibility). The process of capturing information about the operation of an application is often called 'logging'. For this purpose, PaPy utilizes the Python standard library's 'logging' facility, and automatically records logging statements emitted at various (user-specifiable) levels of detail or severity - e.g., DEBUG, INFO, WARNING, ERROR can be logged by the papy and numap modules. Python supplies rich exception-handling capabilities, and user-written functions need only raise meaningful exceptions on errors in order to be properly monitored.
Sooner or later in the life-cycle of a workflow, an error or exception will occur. This will most likely happen within a Worker-wrapped function as a result of bogus or unforseen input data, timeouts, or bugs triggered in external libraries. PaPy is resilient to such errors, insofar as exceptions raised within functions are caught, recorded and wrapped into 'placeholders' that traverse the workflow down-stream without disrupting its execution. The execution log will contain information about the error and the data involved.
Results & Discussion
While a thorough description of PaPy's usage, from novice to intermediate to advanced levels, lies beyond the scope of this article, the following sections (i) illustrate some of the basic features of PaPy and its accompanying NuBio package (Examples 1, 2, 3), (ii) provide heavily-annotated, generic pipeline templates (see also Additional File 1), (iii) outline a more intricate PaPy workflow (simulation-based loop refinement, the details of which are in the Additional File 1), and (iv) briefly consider issues of computational efficiency.
Example 1: PaPy's Workers and Pipers
The basic functionality of a node (Piper) in a PaPy pipeline is literally defined by the node's Worker object (Table 2 and the 'W ' in Figure 1A). Instances of the core Worker class are constructed by wrapping functions (user-created or external), and this can be done in a highly general and flexible manner: A Worker instance can be constructed de novo (as a single, user-defined Python function), from multiple pre-defined functions (as a tuple of functions and positional or keyworded arguments), from another Worker instance, or as a composition of multiple Worker instances. To demonstrate these concepts, consider the following block of code:
from papy import Worker
from math import radians, degrees, pi
def papy_radians(input): return radians(input)
def papy_degrees(input): return degrees(input)
worker_instance1 = Worker(papy_radians)
worker_instance1([90.]) # returns 1.57 (i.e., pi/2)
worker_instance2 = Worker(papy_degrees)
worker_instance2([pi]) # returns 180.0
# Note double parentheses (tuple!) in the following:
worker_instance_f1f2 = Worker((papy_radians, papy_degrees)) worker_instance_f1f2([90.]) # returns 90. (rad/deg invert!)
# Another way, compose from Worker instances:
worker_instance_w1w2 = Worker ((worker_instance1,\worker_instance2))
# Yields same result as worker_instance_f1f2([90.]): worker_instance_w1w2([90.])
In summary, Worker objects fulfill several key roles in a pipeline: They (i) standardize the input/output of nodes (pipers); (ii) allow one to re-use and re-combine functions into custom nodes; (iii) provide a pipeline with graceful fault-tolerance, as they catch and wrap exceptions raised within their functions; and (iv) wrap functions in order to enable them to be evaluated on remote hosts.
The following block of Python illustrates the next 'higher' level in PaPy's operation - Encapsulating Worker-wrapped functions into Piper instances. In addition to what is done (Workers), the Piper level wraps NuMap objects to define the mode of execution (serial/parallel, processes/threads, local/remote, ordered/unordered output, etc.); therefore, a Piper can be considered as the minimal logical processing unit in a pipeline (squares in Figure 1, 2A, 3A, 4A).
from papy import Worker, Piper from numap import NuMap
from math import sqrt
# Square-root worker:
def papy_sqrt(input): return sqrt(input)
sqrt_worker = Worker(papy_sqrt)
my_local_numap = NuMap() # Simple (default) NuMap instance
# Fancier NuMap worker-pool:
# my_local_numap = NuMap(worker_type ="thread",\
# worker_num = 4, stride = 4)
my_piper_instance = Piper(worker = sqrt_worker, \ parallel = my_local_numap)
# returns [1.0, 1.414..., 1.732...]
# following will not work, as piper hasn't been stopped:
# ...but nflow the call to disconnect will work:
The middle portion (lines 7-12) of the above block of code illustrates two examples of NuMap construction, which, in turn, defines the mode of execution of a PaPy workflow - Either a default NuMap (line 7), or one that specifies multi-threaded parallel execution using four workers (lines 9-12).
Example 2: Basic sequence objects in NuBio
As outlined in the earlier Bioinformatics workflows section, the NuBio package was written to extend PaPy's feature set by including basic support for handling biomolecular data, in as flexible and generalized a manner as possible. To this end, NuBio represents all biomolecular data as hierarchical, multidimensional entities, and uses standard programming concepts (such as 'slices') to access and manipulate these entities. For instance, in this frame-work, a single nucleotide is a scalar object comprised of potentially n-dimensional entities (i.e., a character), a DNA sequence or other nucleotide string is a vector of rank-1 objects (nucleotides), a multiple sequence alignment of n sequences is analogous to a rank-3 tensor (an (n-dim) array of (1-dim) strings, each composed of characters), and so on. The following blocks of code tangibly illustrate these concepts (output is denoted by '- > '):
from nubio import NtSeq, CodonSeq
from string import upper, lower
# A sequence of eight codons:
my_codons_1 = CodonSeq('GUUAUUAGGGGUAUCAAUAUAGCU')
# ...and the third one in it, using the 'get_child' method:
my_codons_1_3 = my_codons_1.get_child(2)
# ...and its raw (internal) representation as a byte string
# (ASCII char codes):
-> Codon('b', [65, 71, 71])
# Use the 'tobytes' method to dump as a char string: print my_codons_1_3.tobytes()
# 'get_items' returns the codon as a Python tuple: print my_codons_1.get_item(2)
-> ('A', 'G', 'G')
# The string 'UGUGCUAUGA' isn't a multiple of 3 (rejected
# as codon object), but is a valid NT sequence object:
my_nts_1 = NtSeq('UGUGCUAUGA')
# To make its (DNA) complement:
my_nts_1_comp = my_nts_http://1.complement() print my_nts_1_complement ()
# Sample application of a string method, rendering the
# original sequence lowercase (in-place modification):
print my_nts_1.tobytes() -> ugugcuauga
# Use NuBio's hierarchical representations and data conta-
# iners to perform simple sequence(/string) manipulation:
# grab nucleotides 3-7 (inclusive) from the above NT string:
my_nts_1_3to7 = my_nts_1.get_chunk((slice(2, 7), slice(0,1))) print my_nts_1_3to7.tobytes()
# Get all but the first and last (-1) NTs from the above NT
my_nts_1_NoEnds = my_nts_1.get_chunk((slice(1, -1), \ slice(0,1)))
print my_nts_1_NoEnds.tobytes()-> gugcuaug
# Get codons 2 and 3 (as a flat string) from the codon string:
my_codons_1_2to3 = my_codons_1.get_chunk((slice(1,3,1), \
print my_codons_1_2to3.tobytes() -> AUUAGG
# Grab just the 3rd (wobble) position NT from each codon:
my_codons_1_wobble = my_codons_1.get_chunk((slice(0,10,1), n
print my_codons_1_wobble.tobytes() -> UUGUCUAU
For general convenience and utility, NuBio's data structures can access built-in dictionaries provided by this package (e.g., the genetic code). In the following example, a sequence of codons is translated:
# Simple: Methionine codon, followed by the opal stop codon:
nt_start_stop = NtSeq("ATGTGA")
# Instantiate a (translate-able) CodonSeq object from this:
codon_start_stop = CodonSeq(nt_start_stop.data)
# ...and translate it:
print(codon_start_stop.translate(strict = True))
The follflowing block illustrates manipulations with protein sequences:
from nubio import AaSeq, AaAln
# Define two protein sequences. Associate some metadata (pI,
# MW, whatever) with the second one, as key/value pairs:
seq1 = AaSeq('MSTAP')
seq2 = AaSeq('M-TAP', meta='my_key':'my_data')
# Create an 'alignment' object, and print its sequences:
aln = AaAln((seq1, seq2))
for seq in aln: print seq
# Print the last 'seq' ("M-TAP"), sans gapped residues
# (i.e., restrict to just the amino acid ALPHABET):
# Retrieve metadata associated with 'my_key': aln.meta['my_key']
Example 3: Produce/spawn/consume parallelism
Loosely-coupled data can be parallelized at the data item-level via the produce/consume/spawn idiom (Figure 2). To illustrate how readily this workflow pattern can be implemented in PaPy, the source code includes a generic example in doc/examples/hello_produce_spawn_consume.py. The 'hello_*' files in the doc/examples/ directory provide numerous other samples too, including creation of parallel pipers, local grids as the target execution environment, and a highly generic workflow template.
Generic pipeline templates
To assist one in getting started with bioinformatic pipelines, PaPy also includes a generic pipeline template (Additional File 1 §1.1; 'doc/workflows/pipeline.py') and a sample workflow that illustrates papy/nubio integration (Additional File 1 §1.2; 'doc/examples/hello_workflow.py'). The prototype pipeline includes commonly encountered workflow features, such as the branch/merge topology. Most importantly, the example code is annotated with descriptive comments, and is written in a highly modular manner (consisting of six discrete stages, as described in Additional File 1). The latter feature contributes to clean workflow design, aiming to decouple those types of tasks which are logically independent of one another (e.g, definitions of worker functions, workflow topology, and compute resources need not be linked).
Advanced example: An intricate PaPy workflow
Achieving speed-ups of workflow execution is non-trivial, as process-based parallelism involves (i) computational overhead from serialization; (ii) data transmission over potentially low-bandwidth/high-latency communication channels; (iii) process synchronization, and the associated waiting periods; and (iv) a potential bottelneck from the sole manager process (Figure 4). PaPy allows one to address these issues. Performance optimization is an activity that is mostly independent of workflow construction, and may include collapsing multiple processing nodes that preserves locality and increase granularity (Figure 1), employing direct IPC (Figure 4 Table 3), adjustments of speedup/memory trade-off parameter (Figure 3), allowing for unordered flow of data and, finally, balanced distribution of computational resources among segments of the pipeline. The PaPy documentation further addresses these intricacies, and suggests possible optimization solutions for common usage scenarios.
In addition to full descriptions of the generic PaPy pipeline template and the sample loop-refinement workflow (Additional File 1), further information is available. In particular, the documentation distributed with the source-code provides extensive descriptions of both conceptual and practical aspects of workflow design and execution. Along with overviews and introductory descriptions, this thorough (≈50-page) manual includes (i) complete, step-by-step installation instructions for the Unix/Linux platform; (ii) a Quick Introduction describing PaPy's basic design, object-oriented architecture, and core components (classes), in addition to hands-on illustrations of most concepts via code snippets; (iii) an extensive presentation of parallelism-related concepts, such as maps, iterated maps, NuMap, and so on; (iv) a glossary of PaPy-related terms; and (v) because PaPy is more of a library than a program, a complete description of its application programming interface (API).
Although a thorough analysis of PaPy's relationship to existing workflow-related software solutions lies beyond the scope of this report, Additional File 1 (§4) also includes a comparative overview of PaPy, in terms of its similarities and differences to an example of a higher-level/heavyweight WMS suite (KNIME).
PaPy is a Python-based library for the creation and execution of cross-platform scientific workflows. Augmented with a 'NuMap' parallel execution engine and a 'NuBio' package for generalized biomolecular data structures, PaPy also provides a lightweight tool for data-processing pipelines that are specific to bioinformatics. PaPy's programming interface reflects its underlying dataflow and object-oriented programming paradigms, and it enables parallel execution through modern concepts such as the worker-pool and producer/consumer programming patterns. While PaPy is suitable for pipelines concerned with data-processing and analysis (data reduction), it also could be useful for replicated simulations and other types of workflows which involve computationally-expensive components that generate large volumes of data.
Availability and requirements
Project name: PaPy
Project homepage: http://muralab.org/PaPy
Operating system: GNU/Linux
Programming language: Python
Other requirements: A modern release of Python (≥2.5) is advised; the standard, freely-available Python package RPyC is an optional dependency (for distributed computing).
License: New BSD License
Any restrictions to use by non-academics: None; the software is readily available to anyone wishing to use it.
List of abbreviations
application programming interface
directed acyclic graph
remote Python calls
workflow management system
The Univ of Virginia and the Jeffress Memorial Trust (J-971) are gratefully acknowledged for funding this work.
- Gil A, Deelman E, Ellisman M, Fahringer T, Fox G, Goble C, Livny M, Moreau L, Myers J: Examining the Challenges of Scientific Workflows. IEEE Computer vol 2007, 40: 24–32.View ArticleGoogle Scholar
- Johnston WM, Hanna JRP, Millar RJ: Advances in dataflow programming languages. ACM Comput Surv 2004, 36: 1–34. 10.1145/1013208.1013209View ArticleGoogle Scholar
- Ludäscher B, Altintas I, Berkley C, Higgins D, Jaeger E, Jones M, Lee EA, Tao J, Zhao Y: Scientific workflow management and the Kepler system. Concurrency and Computation: Practice & Experience 2006, 18(10):1039–1065.View ArticleGoogle Scholar
- Halling-Brown M, Shepherd AJ: Constructing Computational Pipelines. In Bioinformatics, Methods in Molecular Biology™. Volume 453. Edited by: Keith JM. Totflowa, NJ: Humana Press; 2008:451–470.Google Scholar
- Deelman E, Gannon D, Shields M, Taylor I: Workflows and e-Science: An overview of workflow system features and capabilities. Future Gener Comput Syst 2009, 25(5):528–540. 10.1016/j.future.2008.06.012View ArticleGoogle Scholar
- Tiwari A, Sekhar AKT: Workflow based framework for life science informatics. Comput Biol Chem 2007, 31(5–6):305–319. 10.1016/j.compbiolchem.2007.08.009View ArticlePubMedGoogle Scholar
- Romano P: Automation of in-silico data analysis processes through workflow management systems. Brief Bioinform 2008, 9: 57–68. 10.1093/bib/bbm056View ArticlePubMedGoogle Scholar
- Bartocci E, Corradini F, Merelli E, Scortichini L: BioWMS: a web-based Workflow Management System for bioinformatics. BMC Bioinformatics 2007, 8(Suppl 1):S2. 10.1186/1471-2105-8-S1-S2PubMed CentralView ArticlePubMedGoogle Scholar
- Orvis J, Crabtree J, Galens K, Gussman A, Inman JM, Lee E, Nampally S, Riley D, Sundaram JP, Felix V, Whitty B, Mahurkar A, Wortman J, White O, Angiuoli SV: Ergatis: a web interface and scalable software system for bioinformatics workflows. Bioinformatics 2010, 26(12):1488–1492. 10.1093/bioinformatics/btq167PubMed CentralView ArticlePubMedGoogle Scholar
- Fiers MWEJ, van der Burgt A, Datema E, de Groot JCW, van Ham RCHJ: High-throughput bioinformatics with the Cyrille2 pipeline system. BMC Bioinformatics 2008, 9: 96. 10.1186/1471-2105-9-96PubMed CentralView ArticlePubMedGoogle Scholar
- Eker J, Janneck JW, Lee EA, Liu J, Liu X, Ludvig J, Neuendorffer S, Sachs S, Xiong Y: Taming heterogeneity - the Ptolemy approach. Proceedings of the IEEE 2003, 91: 127–144. 10.1109/JPROC.2002.805829View ArticleGoogle Scholar
- Oinn T, Addis M, Ferris J, Marvin D, Senger M, Green-wood M, Carver T, Glover K, Pocock MR, Wipat A, Li P: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 2004, 20(17):3045–3054. 10.1093/bioinformatics/bth361View ArticlePubMedGoogle Scholar
- Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Thiel K, Wiswedel B: KNIME - The Konstanz Information Miner. SIGKDD Explorations 2009., 11: 10.1145/1656274.1656280Google Scholar
- Deelman E, Singh G, hui Su M, Blythe J, Gil A, Kessel-man C, Mehta G, Vahi K, Berriman GB, Good J, Laity A, Jacob JC, Katz DS: Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Scientific Programming Journal 2005, 13: 219–237.View ArticleGoogle Scholar
- Romano P, Bartocci E, Bertolini G, Paoli FD, Marra D, Mauri G, Merelli E, Milanesi L: Biflowep: a workflow enactment portal for bioinformatics applications. BMC Bioinformatics 2007, 8(Suppl 1):S19. 10.1186/1471-2105-8-S1-S19PubMed CentralView ArticlePubMedGoogle Scholar
- Hoon S, Ratnapu KK, Chia JM, Kumarasamy B, Juguang X, Clamp M, Stabenau A, Potter S, Clarke L, Stupka E: Biopipe: a flexible framework for protocol-based bioinformatics analysis. Genome Res 2003, 13(8):1904–1915.PubMed CentralPubMedGoogle Scholar
- Radetzki U, Leser U, Schulze-Rauschenbach SC, Zimmer-mann J, Lüssem J, Bode T, Cremers AB: Adapters, shims, and glue-service interoperability for in silico experiments. Bioinformatics 2006, 22(9):1137–1143. 10.1093/bioinformatics/btl054View ArticlePubMedGoogle Scholar
- Lin C, Lu S, Fei X, Pai D, Hua J: A Task Abstraction and Mapping Approach to the Shimming Problem in Scientific Workflows. In SCC '09: Proceedings of the 2009 IEEE International Conference on Services Computing. Washington, DC, USA: IEEE Computer Society; 2009:284–291.View ArticleGoogle Scholar
- Morrison JP: Flow-Based Programming: A New Approach to Application Development. CreateSpace; 2010.Google Scholar
- Object-relational impedance mismatch[http://en.wikipedia.org/wiki/Object-relational_impedance_mismatch]
- Van der Aalst W: The application of Petri nets to workflow management. Journal of Circuits Systems and Computers 1998, 8: 21–66. 10.1142/S0218126698000043View ArticleGoogle Scholar
- Python multiprocessing interface[http://docs.python.org/library/multiprocessing.html]
- RPyC - Remote Python Calls[http://rpyc.wikidot.com]
- Google Labs' WorkerPool API[http://code.google.com/apis/gears/api_workerpool.html]
- Python decorators[http://wiki.python.org/moin/PythonDecorators]
- NumPy's View casting[http://docs.scipy.org/doc/numpy/user/basics.subclassing.html#view-casting]
- Schatz MC: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 2009, 25(11):1363–1369. 10.1093/bioinformatics/btp236PubMed CentralView ArticlePubMedGoogle Scholar
- Tu T, Rendleman CA, Borhani DW, Dror RO, Gullingsrud J, Jensen MO, Klepeis JL, Maragakis P, Miller P, Stafford KA, Shaw DE: A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. Piscataway, NJ, USA: IEEE Press; 2008:1–12.Google Scholar
- Earl D, Deem MW: Parallel tempering: Theory, applications, and new perspectives. Physical Chemistry Chemical Physics 2005, 7(23):3910–3916. 10.1039/b509983hView ArticlePubMedGoogle Scholar
- Luckow A, Jha S, Kim J, Merzky A, Schnor B: Adaptive distributed replica-exchange simulations. Philos Transact A Math Phys Eng Sci 2009, 367(1897):2595–2606. 10.1098/rsta.2009.0051View ArticleGoogle Scholar
- Misra J: A Discipline of Multiprogramming: Programming Theory for Distributed Applications. Springer; 2001.View ArticleGoogle Scholar
- Jeffay K: The real-time producer/consumer paradigm: A paradigm for the construction of efficient, predictable real-time systems. In SAC '93: Proceedings of the 1993 ACM/SIGAPP symposium on Applied computing. New York, NY, USA: ACM; 1993:796–804.View ArticleGoogle Scholar
- Dean J, Ghemawat S: MapReduce: Simplified data processing on large clusters. Communications of the ACM 2008, 51: 107–113. 10.1145/1327452.1327492View ArticleGoogle Scholar
- Yao Z, Barrick J, Weinberg Z, Neph S, Breaker R, Tompa M, Ruzzo WL: A computational pipeline for high-throughput discovery of cisregulatory noncoding RNA in prokaryotes. PLoS Comput Biol 2007, 3(7):e126. 10.1371/journal.pcbi.0030126PubMed CentralView ArticlePubMedGoogle Scholar
- Pierce BC: Types and programming languages. Cambridge, MA, USA: MIT Press; 2002.Google Scholar
- Vandervalk BP, McCarthy EL, Wilkinson MD: Moby and Moby 2: creatures of the deep (web). Brief Bioinform 2009, 10(2):114–128. 10.1093/bib/bbn051View ArticlePubMedGoogle Scholar
- Liu P, Wu JJ, Yang CH: Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors. Journal of Information Science and Engineering 2002.Google Scholar
- Soto CS, Fasnacht M, Zhu J, Forrest L, Honig B: Loop modeling: Sampling, filtering, and scoring. Proteins 2008, 70(3):834–843. 10.1002/prot.21612PubMed CentralView ArticlePubMedGoogle Scholar
- Kannan S, Zacharias M: Application of biasing-potential replica-exchange simulations for loop modeling and refinement of proteins in explicit solvent. Proteins 2010.Google Scholar
- Frishman D, Argos P: Knflowledge-based protein secondary structure assignment. Proteins 1995, 23(4):566–579. 10.1002/prot.340230412View ArticlePubMedGoogle Scholar
- Hinsen K: The molecular modeling toolkit: A new approach to molecular simulations. Journal of Computational Chemistry 2000, 21(2):79–85. 10.1002/(SICI)1096-987X(20000130)21:2<79::AID-JCC1>3.0.CO;2-BView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.