Cloud computing for comparative genomics

BMC Bioinformatics

Table 1 Elastic Map Reduce commands

Argument	Description	Input
--stream	Activates the "streaming" module	N/A
--input	File(s) to be processed by EMR	hdfs:///home/hadoop/blast_runner hdfs:///home/hadoop/ortho_runner
--mapper	Name of mapper file	s3n://rsd_bucket/blast_mapper.py s3n://rsd_bucket/ortho_mapper.py
--reducer	None required, reduction done within RSD algorithm	N/A
--cache-archive	Individual symlinks to the executables, genomes,	s3n://rsd_bucket/executables.tar.gz #executables,#genomes, #RSD_standalone,#blastinput,#results
--output		hdfs:///home/hadoop/outl
-- jobconf mapred.map.tasks	Number of blast and ortholog calculation processes	= N
-- jobconf mapred.tasktracker.map.tasks.maximum	Total number of task trackers	= 8
--jobconf mapred. task, timeout	Time at which a process was considered a failure and restarted	= 86400000 ms
--jobconf mapred.tasktracker.expiry.interval	Time at which an instance was declared dead.	3600000 (set to be large to avoid instance shut down with long running jobs)
--jobconf mapred.map.tasks.speculative.execution	If true, EMR will speculate that a job is running slow and run the same job in parallel	False (because the time for each genome-vs-genome run varied widely, we elected to set this argument to False to ensure maximal availability of the cluster)

Specific commands passed through the Ruby command line client to the Elastic MapReduce program (EMR) from Amazon Web Services. The inputs specified correspond to (1) the BLAST step and (2) the ortholog computation step of the RSD cloud algorithm. These configurations settings correspond to both the EMR and Hadoop frameworks, with two exceptions: In EMR, a --j parameter can be used to provide an identifier for the entire cluster, useful only in cases where more than one cloud cluster is needed simultaneously. In Hadoop, these commands are passed directly to the streaming.jar program, obviating the need for the --stream argument.

ISSN: 1471-2105