While a thorough description of PaPy's usage, from novice to intermediate to advanced levels, lies beyond the scope of this article, the following sections (i) illustrate some of the basic features of PaPy and its accompanying NuBio package (Examples 1, 2, 3), (ii) provide heavily-annotated, generic pipeline templates (see also Additional File 1), (iii) outline a more intricate PaPy workflow (simulation-based loop refinement, the details of which are in the Additional File 1), and (iv) briefly consider issues of computational efficiency.
Example 1: PaPy's Workers and Pipers
The basic functionality of a node (Piper) in a PaPy pipeline is literally defined by the node's Worker object (Table 2 and the 'W ' in Figure 1A). Instances of the core Worker class are constructed by wrapping functions (user-created or external), and this can be done in a highly general and flexible manner: A Worker instance can be constructed de novo (as a single, user-defined Python function), from multiple pre-defined functions (as a tuple of functions and positional or keyworded arguments), from another Worker instance, or as a composition of multiple Worker instances. To demonstrate these concepts, consider the following block of code:
from papy import Worker
from math import radians, degrees, pi
def papy_radians(input): return radians(input[0])
def papy_degrees(input): return degrees(input[0])
worker_instance1 = Worker(papy_radians)
worker_instance1([90.]) # returns 1.57 (i.e., pi/2)
worker_instance2 = Worker(papy_degrees)
worker_instance2([pi]) # returns 180.0
# Note double parentheses (tuple!) in the following:
worker_instance_f1f2 = Worker((papy_radians, papy_degrees)) worker_instance_f1f2([90.]) # returns 90. (rad/deg invert!)
# Another way, compose from Worker instances:
worker_instance_w1w2 =
Worker
((worker_instance1,\worker_instance2))
# Yields same result as worker_instance_f1f2([90.]): worker_instance_w1w2([90.])
In summary, Worker objects fulfill several key roles in a pipeline: They (i) standardize the input/output of nodes (pipers); (ii) allow one to re-use and re-combine functions into custom nodes; (iii) provide a pipeline with graceful fault-tolerance, as they catch and wrap exceptions raised within their functions; and (iv) wrap functions in order to enable them to be evaluated on remote hosts.
The following block of Python illustrates the next 'higher' level in PaPy's operation - Encapsulating Worker-wrapped functions into Piper instances. In addition to what is done (Workers), the Piper level wraps NuMap objects to define the mode of execution (serial/parallel, processes/threads, local/remote, ordered/unordered output, etc.); therefore, a Piper can be considered as the minimal logical processing unit in a pipeline (squares in Figure 1, 2A, 3A, 4A).
from papy import Worker, Piper from numap import NuMap
from math import sqrt
# Square-root worker:
def papy_sqrt(input): return sqrt(input[0])
sqrt_worker = Worker(papy_sqrt)
my_local_numap = NuMap() # Simple (default) NuMap instance
# Fancier NuMap worker-pool:
# my_local_numap = NuMap(worker_type ="thread",\
# worker_num = 4, stride = 4)
my_piper_instance = Piper(worker = sqrt_worker, \ parallel = my_local_numap)
my_piper_instance([1,2,3]).start() list(my_piper_instance)
# returns [1.0, 1.414..., 1.732...]
# following will
not
work, as piper hasn't been stopped:
my_piper_instance.disconnect()
# ...but nflow the call to disconnect
will
work:
my_piper_instance.stop() my_piper_instance.disconnect()
The middle portion (lines 7-12) of the above block of code illustrates two examples of NuMap construction, which, in turn, defines the mode of execution of a PaPy workflow - Either a default NuMap (line 7), or one that specifies multi-threaded parallel execution using four workers (lines 9-12).
Example 2: Basic sequence objects in NuBio
As outlined in the earlier Bioinformatics workflows section, the NuBio package was written to extend PaPy's feature set by including basic support for handling biomolecular data, in as flexible and generalized a manner as possible. To this end, NuBio represents all biomolecular data as hierarchical, multidimensional entities, and uses standard programming concepts (such as 'slices') to access and manipulate these entities. For instance, in this frame-work, a single nucleotide is a scalar object comprised of potentially n-dimensional entities (i.e., a character), a DNA sequence or other nucleotide string is a vector of rank-1 objects (nucleotides), a multiple sequence alignment of n sequences is analogous to a rank-3 tensor (an (n-dim) array of (1-dim) strings, each composed of characters), and so on. The following blocks of code tangibly illustrate these concepts (output is denoted by '- > '):
from nubio import NtSeq, CodonSeq
from string import upper, lower
# A sequence of eight codons:
my_codons_1 = CodonSeq('GUUAUUAGGGGUAUCAAUAUAGCU')
# ...and the third one in it, using the 'get_child' method:
my_codons_1_3 = my_codons_1.get_child(2)
# ...and its raw (internal) representation as a byte string
# (ASCII char codes):
print my_codons_1_3
-> Codon('b', [65, 71, 71])
# Use the 'tobytes' method to dump as a char string: print my_codons_1_3.tobytes()
-> AGG
# 'get_items' returns the codon as a Python tuple: print my_codons_1.get_item(2)
-> ('A', 'G', 'G')
# The string 'UGUGCUAUGA' isn't a multiple of 3 (rejected
# as codon object), but is a valid NT sequence object:
my_nts_1 = NtSeq('UGUGCUAUGA')
# To make its (DNA) complement:
my_nts_1_comp = my_nts_http://1.complement() print my_nts_1_complement ()
-> ACACGATACT
# Sample application of a string method, rendering the
# original sequence lowercase (in-place modification):
my_nts_1.str(method="lower")
print my_nts_1.tobytes() -> ugugcuauga
# Use NuBio's hierarchical representations and data conta-
# iners to perform simple sequence(/string) manipulation:
# grab nucleotides 3-7 (inclusive) from the above NT string:
my_nts_1_3to7 = my_nts_1.get_chunk((slice(2, 7), slice(0,1))) print my_nts_1_3to7.tobytes()
-> ugcua
# Get all but the first and last (-1) NTs from the above NT
# string:
my_nts_1_NoEnds = my_nts_1.get_chunk((slice(1, -1), \ slice(0,1)))
print my_nts_1_NoEnds.tobytes()-> gugcuaug
# Get codons 2 and 3 (as a flat string) from the codon string:
my_codons_1_2to3 = my_codons_1.get_chunk((slice(1,3,1), \
slice(0,3,1)))
print my_codons_1_2to3.tobytes() -> AUUAGG
# Grab just the 3rd (wobble) position NT from each codon:
my_codons_1_wobble = my_codons_1.get_chunk((slice(0,10,1), n
slice(2,10,1)))
print my_codons_1_wobble.tobytes() -> UUGUCUAU
For general convenience and utility, NuBio's data structures can access built-in dictionaries provided by this package (e.g., the genetic code). In the following example, a sequence of codons is translated:
# Simple: Methionine codon, followed by the opal stop codon:
nt_start_stop = NtSeq("ATGTGA")
# Instantiate a (translate-able) CodonSeq object from this:
codon_start_stop = CodonSeq(nt_start_stop.data)
# ...and translate it:
print(codon_start_stop.translate()) ->
-> AaSeq(M*)
print(codon_start_stop.translate(strict = True))
-> AaSeq(M)
The follflowing block illustrates manipulations with protein sequences:
from nubio import AaSeq, AaAln
# Define two protein sequences. Associate some metadata (pI,
# MW, whatever) with the second one, as key/value pairs:
seq1 = AaSeq('MSTAP')
seq2 = AaSeq('M-TAP', meta='my_key':'my_data')
# Create an 'alignment' object, and print its sequences:
aln = AaAln((seq1, seq2))
for seq in aln: print seq
-> AaSeq(MSTAP)
-> AaSeq(M-TAP)
# Print the last 'seq' ("M-TAP"), sans gapped residues
# (i.e., restrict to just the amino acid ALPHABET):
print seq.keep(seq.meta['ALPHABET'])
-> AaSeq(MTAP)
# Retrieve metadata associated with 'my_key': aln[1].meta['my_key']
-> 'my_data'
Example 3: Produce/spawn/consume parallelism
Loosely-coupled data can be parallelized at the data item-level via the produce/consume/spawn idiom (Figure 2). To illustrate how readily this workflow pattern can be implemented in PaPy, the source code includes a generic example in doc/examples/hello_produce_spawn_consume.py. The 'hello_*' files in the doc/examples/ directory provide numerous other samples too, including creation of parallel pipers, local grids as the target execution environment, and a highly generic workflow template.
Generic pipeline templates
To assist one in getting started with bioinformatic pipelines, PaPy also includes a generic pipeline template (Additional File 1 §1.1; 'doc/workflows/pipeline.py') and a sample workflow that illustrates papy/nubio integration (Additional File 1 §1.2; 'doc/examples/hello_workflow.py'). The prototype pipeline includes commonly encountered workflow features, such as the branch/merge topology. Most importantly, the example code is annotated with descriptive comments, and is written in a highly modular manner (consisting of six discrete stages, as described in Additional File 1). The latter feature contributes to clean workflow design, aiming to decouple those types of tasks which are logically independent of one another (e.g, definitions of worker functions, workflow topology, and compute resources need not be linked).
Advanced example: An intricate PaPy workflow
In protein homology modelling, potentially flexible loop regions that link more rigid secondary structural elements are often difficult to model accurately (e.g. [38]). A possible strategy to improve the predicted 3 D structures of loops involves better sampling the accessible conformational states of loop backbones, often using simulation-based approaches (e.g. [39]). Though a complete, PaPy-based implementation of loop refinement is beyond the scientific scope of this work, we include a use-case inspired by this problem for two primary reasons: (1) The workflow solution demonstrates how to integrate third-party software packages into PaPy (e.g., Stride [40] to compute loop boundaries as regions between secondary structural elements, MMTK [41] for energy calculations and simulations); (2) Loop-refinement illustrates how an intricate structural bioinformatics workflow can be expressed as a PaPy pipeline. This advanced workflow demonstrates constructs such as nested functions, forked pipelines, the produce/s-pawn/consume idiom, iterative loops, and conditional logic. The workflow is schematized in Figure 5 and a complete description of this case study, including source code, can be found in Additional File 1 (§2 and Fig. S1, showing parallelization over loops and bounding spheres).
Computational efficiency
Achieving speed-ups of workflow execution is non-trivial, as process-based parallelism involves (i) computational overhead from serialization; (ii) data transmission over potentially low-bandwidth/high-latency communication channels; (iii) process synchronization, and the associated waiting periods; and (iv) a potential bottelneck from the sole manager process (Figure 4). PaPy allows one to address these issues. Performance optimization is an activity that is mostly independent of workflow construction, and may include collapsing multiple processing nodes that preserves locality and increase granularity (Figure 1), employing direct IPC (Figure 4 Table 3), adjustments of speedup/memory trade-off parameter (Figure 3), allowing for unordered flow of data and, finally, balanced distribution of computational resources among segments of the pipeline. The PaPy documentation further addresses these intricacies, and suggests possible optimization solutions for common usage scenarios.
Further information
In addition to full descriptions of the generic PaPy pipeline template and the sample loop-refinement workflow (Additional File 1), further information is available. In particular, the documentation distributed with the source-code provides extensive descriptions of both conceptual and practical aspects of workflow design and execution. Along with overviews and introductory descriptions, this thorough (≈50-page) manual includes (i) complete, step-by-step installation instructions for the Unix/Linux platform; (ii) a Quick Introduction describing PaPy's basic design, object-oriented architecture, and core components (classes), in addition to hands-on illustrations of most concepts via code snippets; (iii) an extensive presentation of parallelism-related concepts, such as maps, iterated maps, NuMap, and so on; (iv) a glossary of PaPy-related terms; and (v) because PaPy is more of a library than a program, a complete description of its application programming interface (API).
Although a thorough analysis of PaPy's relationship to existing workflow-related software solutions lies beyond the scope of this report, Additional File 1 (§4) also includes a comparative overview of PaPy, in terms of its similarities and differences to an example of a higher-level/heavyweight WMS suite (KNIME).