Argument | Description | Input |
---|---|---|
--stream | Activates the "streaming" module | N/A |
--input | File(s) to be processed by EMR | hdfs:///home/hadoop/blast_runner hdfs:///home/hadoop/ortho_runner |
--mapper | Name of mapper file | s3n://rsd_bucket/blast_mapper.py s3n://rsd_bucket/ortho_mapper.py |
--reducer | None required, reduction done within RSD algorithm | N/A |
--cache-archive | Individual symlinks to the executables, genomes, | s3n://rsd_bucket/executables.tar.gz #executables,#genomes, #RSD_standalone,#blastinput,#results |
--output | Â | hdfs:///home/hadoop/outl |
-- jobconf mapred.map.tasks | Number of blast and ortholog calculation processes | = N |
-- jobconf mapred.tasktracker.map.tasks.maximum | Total number of task trackers | = 8 |
--jobconf mapred. task, timeout | Time at which a process was considered a failure and restarted | = 86400000 ms |
--jobconf mapred.tasktracker.expiry.interval | Time at which an instance was declared dead. | 3600000 (set to be large to avoid instance shut down with long running jobs) |
--jobconf mapred.map.tasks.speculative.execution | If true, EMR will speculate that a job is running slow and run the same job in parallel | False (because the time for each genome-vs-genome run varied widely, we elected to set this argument to False to ensure maximal availability of the cluster) |