Skip to main content

Table 1 Elastic Map Reduce commands

From: Cloud computing for comparative genomics

Argument Description Input
--stream Activates the "streaming" module N/A
--input File(s) to be processed by EMR hdfs:///home/hadoop/blast_runner hdfs:///home/hadoop/ortho_runner
--mapper Name of mapper file s3n://rsd_bucket/blast_mapper.py s3n://rsd_bucket/ortho_mapper.py
--reducer None required, reduction done within RSD algorithm N/A
--cache-archive Individual symlinks to the executables, genomes, s3n://rsd_bucket/executables.tar.gz #executables,#genomes, #RSD_standalone,#blastinput,#results
--output   hdfs:///home/hadoop/outl
-- jobconf mapred.map.tasks Number of blast and ortholog calculation processes = N
-- jobconf mapred.tasktracker.map.tasks.maximum Total number of task trackers = 8
--jobconf mapred. task, timeout Time at which a process was considered a failure and restarted = 86400000 ms
--jobconf mapred.tasktracker.expiry.interval Time at which an instance was declared dead. 3600000 (set to be large to avoid instance shut down with long running jobs)
--jobconf mapred.map.tasks.speculative.execution If true, EMR will speculate that a job is running slow and run the same job in parallel False (because the time for each genome-vs-genome run varied widely, we elected to set this argument to False to ensure maximal availability of the cluster)
  1. Specific commands passed through the Ruby command line client to the Elastic MapReduce program (EMR) from Amazon Web Services. The inputs specified correspond to (1) the BLAST step and (2) the ortholog computation step of the RSD cloud algorithm. These configurations settings correspond to both the EMR and Hadoop frameworks, with two exceptions: In EMR, a --j parameter can be used to provide an identifier for the entire cluster, useful only in cases where more than one cloud cluster is needed simultaneously. In Hadoop, these commands are passed directly to the streaming.jar program, obviating the need for the --stream argument.