Experimenting with database segmentation size vs time performance for mpiBLAST on an IBM HS21 blade cluster
© Jaromczyk et al; licensee BioMed Central Ltd. 2010
Published: 23 July 2010
Large-scale genomic projects such as the Epichloë festucae Genome Project require regular use of bioinformatic tools. When using BLAST in conjunction with larger databases, processing complex sequences often uses substantial computation time. Parallelization is considered a standard method of curbing extensive computing requirements and parallel implementations of BLAST, such as mpiBLAST, are freely available.
Materials and methods
In this experiment, the implementation segments a database into smaller databases so that BLAST queries can be more effectively performed in parallel on smaller database segments. Since there are overhead costs from distributing tasks and merging the results from each parallel run, we investigate how the usefulness of database segmentation changes as the size and the number of the database segments change. When segmentation curbs time-performance, we ask the question: "How many segments will yield the best performance or will adding processors always help?" Specifically, we consider three different times: a one-time preprocessing (segmentation of database), queue wait-time, and CPU-time. We conducted experiments to monitor time-performance as the number of database segments vary on an IBM HS21 blade cluster running mpiBLAST against fungal protein sequences from the Epichloë festucae Genome Project. The cluster has 340 computer nodes (1,360 cores, 12.8 Teraflops) whose resources are shared with other researchers and are controlled through the SLURM batch-job resource-manager and scheduled through the Moab batch-job scheduler.
Results and conclusion
This article is published under license to BioMed Central Ltd.