Volume 11 Supplement 12
Galaxy CloudMan: delivering cloud compute clusters
© Afgan et al; licensee BioMed Central Ltd. 2010
Published: 21 December 2010
Widespread adoption of high-throughput sequencing has greatly increased the scale and sophistication of computational infrastructure needed to perform genomic research. An alternative to building and maintaining local infrastructure is “cloud computing”, which, in principle, offers on demand access to flexible computational infrastructure. However, cloud computing resources are not yet suitable for immediate “as is” use by experimental biologists.
We present a cloud resource management system that makes it possible for individual researchers to compose and control an arbitrarily sized compute cluster on Amazon’s EC2 cloud infrastructure without any informatics requirements. Within this system, an entire suite of biological tools packaged by the NERC Bio-Linux team (http://nebc.nerc.ac.uk/tools/bio-linux) is available for immediate consumption. The provided solution makes it possible, using only a web browser, to create a completely configured compute cluster ready to perform analysis in less than five minutes. Moreover, we provide an automated method for building custom deployments of cloud resources. This approach promotes reproducibility of results and, if desired, allows individuals and labs to add or customize an otherwise available cloud system to better meet their needs.
The expected knowledge and associated effort with deploying a compute cluster in the Amazon EC2 cloud is not trivial. The solution presented in this paper eliminates these barriers, making it possible for researchers to deploy exactly the amount of computing power they need, combined with a wealth of existing analysis software, to handle the ongoing data deluge.
Widespread availability of high-throughput DNA sequencing instruments, and the development of novel techniques based on sequencing, provides a potentially very valuable resource for researchers in genomics. However, transforming sequence data into biologically meaningful information requires sophisticated computational infrastructure and support. The size of the required computational infrastructure is outpacing what many labs and even universities are able to support. In addition, the setup and maintenance associated with a computational infrastructure presents significant problems for individual investigators and small labs that may not have the necessary informatics support.
Fortunately, a computational model – cloud computing  – has recently emerged and is ideally suited for the analysis of large-scale sequence data. In this model, computation and storage exist as virtual resources in remote datacenters, and can be dynamically allocated and released as needed. However, cloud computing resources are not yet suitable for immediate “as is” use by experimental biologists. In the current model, cloud resources are acquired as independent, stripped-down units that must first be customized for the intended use. They then must be configured to work in unison and a mechanism must be provided for the data uploaded and analyzed on those resources to persist beyond the life of a (usually transient) cloud compute resource.
To date, there are several projects and solutions in the context of high throughput sequencing and bioinformatics in general that utilize cloud computing to deliver the computational capacity and on-demand scalability (e.g., Crossbow , CloudBurst , RSD-cloud , Myrna ). However, these projects primarily target specific problems and provide custom solutions for a given tool or methodology. Although very valuable, such tools provide minimal help for researchers wanting to compose a variety of tools into an analysis pipeline or for researchers that want to use their own tools when utilizing cloud computing resources. Thus, there is a need for a flexible solution that can be utilized in a variety of scenarios and that provides support for customization.
This paper introduces CloudMan from the Galaxy Project , an integrated solution that leverages existing tools and packages by providing a generic method for utilizing those tools on cloud resources and abstracting out low-level informatics details. This solution handles all of the intricacies of cloud computing resource acquisition, configuration, and scaling to deliver a personal compute cluster in a matter of minutes. All interaction with Galaxy CloudMan and the associated cloud cluster management is performed through a web based user interface and requires no computational expertise. The deployed cluster comes preconfigured with all of the bioinformatics packages available in the NERC Bio-Linux workstation (version 6)  as well as a range of NGS tools available within the Galaxy framework . Moreover, the process of tool deployment is fully automated and decoupled from the base machine image, making it possible to very simply add additional tools to an individual cloud cluster instantiation.
Instantiating and controlling a cloud cluster
The Galaxy CloudMan application currently supports creation of a compute cluster on Amazon’s EC2  cloud computing infrastructure. The process of instantiating a cluster does not require any computational experience, and requires no compute infrastructure or software beyond the web browser used to control the cluster. Galaxy CloudMan is thus ideal for independent researchers and small labs that have a specific or periodic need for computational resources but lack informatics expertise and commitment to manage and maintain a computational cluster. The process of instantiating a CloudMan compute cluster consists of three steps: (1) create an Amazon Web Services (AWS) account and sign up for the EC2 and S3 services, (2) use the AWS Management Console to start a master EC2 instance, and (3) use the CloudMan web console on the master instance to manage the cluster size. Step one needs to be performed only once, usually by a person controlling the cloud cluster. Steps two and three need to be performed each time running jobs on a compute cluster is desired, but, again, only by the person controlling the cluster. Once set up, additional users may use the cluster simply through the Galaxy web interface without requiring any system accounts or privileges. A single instance of CloudMan controls a single cluster – of potentially variable size – but a single user may create as many CloudMan cluster instances as desired.
Once CloudMan starts, it automatically configures the master instance as a head node of a Sun Grid Engine (SGE)  compute cluster but it does not start any additional worker instances or assign persistent storage to the cluster. In the context of cloud computing, compute instances are usually transient, meaning that any changes made to an instance while the instance is alive are lost at instance termination. In order to persist any data uploaded to the cloud or any analysis results, the data needs to be stored on an external data volume. In the case of CloudMan on EC2, Amazon’s Elastic Block Storage (EBS)  volumes are used for data persistence.
Because Galaxy CloudMan is built on top of a Bio-Linux machine image, all of the tools available within Bio-Linux can be used on the instantiated cloud cluster. Accessing the Bio-Linux tools is realized through a command line interface - just like on any other compute cluster. As indicated earlier, the SGE job manager is configured and used on the cluster, making it possible for users to simply copy their job scripts to the cloud cluster and run them there - but with the scalability offered through cloud computing.
When a given cluster is no longer needed, the CloudMan web interface is used to terminate all of the services and worker instances. If persistent data storage was associated with the cluster, the data is preserved while the cluster is offline, and made available in the same state once the cluster is instantiated again. It takes only a few minutes to scale up or down a cluster and consume the required amount of resources.
By default, the Galaxy CloudMan is built on top of a Bio-Linux machine image available from CloudBioLinux  and thus makes all of the tools packaged by NERC Bio-Linux  immediately available. NERC Bio-Linux represents a set of packaged and fully featured bioinformatics tools that enable users to focus on tool usage rather than tool installation and configuration. By building on top of such varied set of bioinformatics tools, one can combine the cluster controlling functionality of CloudMan with the variety of tools. In addition to the tools available through Bio-Linux, a set of NGS tools available through Galaxy are also available for use, including: Bowtie , BWA , and SAMtools . If a user desires additional tools, we have provided a mechanism for streamlining the tool installation process (see Methods section). A script used to automatically install all the tools available to a default instance of CloudMan cluster is available at https://bitbucket.org/afgane/mi-deployment/; using this script and customizing it to include the desired tools provides a simple method for modifying the capabilities of a cluster instance. The script supports the ability to install additional tools at cluster runtime only or to persist the changes for future cluster invocations.
To keep up with the growth of data being produced in life sciences and the accompanying computational demand, there is a need for increased access to computational resources. Cloud computing offers access to such resources but still makes it difficult to create complex deployments of useful standalone infrastructures. This is especially cumbersome for individuals and small labs that lack informatics support to fully harness this general-purpose infrastructure.
The default version of Galaxy CloudMan can be used “as-is” to support creation and control of fully functional compute clusters on cloud resources; it supports a broad range of bioinformatics tools and it makes it possible to add additional tools with little effort. The process of tool deployment is completely automated and well documented, making it reproducible in other environments. Overall, the system is simple to use and it targets individual researchers so they can gain access to the computational resources they need without requiring support from skilled bioinformatics personnel.
The source code for the entire project is available under the MIT licence and is available from http://bitbucket.org/galaxy/cloudman/. Documentation and detailed instructions on how to use Galaxy CloudMan are also available from http://usegalaxy.org/cloud.
To deliver a ready-to-use compute infrastructure to individual researchers, we decided to build CloudMan as a modular framework that can be used “as is” but also supports easy customization. In keeping with this goal, we have embedded Galaxy CloudMan on top of the Bio-Linux workstation machine image and integrated it with Galaxy. This delivers a broad range of bioinformatics tools directly to CloudMan users.
By having domain specific tools on external data volumes, it is easy to modify the content of those volumes, and thus the availability of underlying tools, without the need to modify and rebuild the machine image. This is convenient for end users as they can focus on using the tools while the CloudMan developers can focus on ensuring the infrastructure behaves as desired. Specifically, in order to update a tools data volume, a user instantiates a cloud cluster, modifies the contents of the tools data volume by installing the desired tools (performed just like on any other machine and also showcased by our tool deployment script), creates a snapshot of the modified data volume (done automatically through the web UI), and points their cluster setup to use the newly created data volume snapshot (done manually through a web UI). As a result, the functionality and the architecture of CloudMan can easily be adjusted to the computing needs of individual researchers and different domains. This makes it possible for end users to add their own tools to a CloudMan cluster instance.
List of abbreviations used
Amazon Web Services
Elastic Block Storage
Elastic Compute Cloud
Simple Storage Service
Sun Grid Engine
Galaxy is developed by the Galaxy Team: Enis Afgan, Guruprasad Ananda, Dannon Baker, Dan Blankenberg, Ramkrishna Chakrabarty, Nate Coraor, Jeremy Goecks, Greg Von Kuster, Ross Lazarus, Kanwei Li, Anton Nekrutenko, James Taylor, and Kelly Vincent. We thank our many collaborators who support and maintain data warehouses and browsers accessible through Galaxy. We would also like to thank the participants of BOSC Codefest 2010, and the Bio-Linux community. This work was supported by NIH grant HG005542 (J.T. and A.N.). Development of the Galaxy framework is also supported by NIH grants HG004909 (A.N. and J.T), HG005133 (J.T. and A.N), by NSF grant DBI-0850103 (A.N. and J.T) and by funds from the Huck Institutes for the Life Sciences and the Institute for CyberScience at Penn State. Additional funding is provided, in part, under a grant with the Pennsylvania Department of Health using Tobacco Settlement Funds. The Department specifically disclaims responsibility for any analyses, interpretations or conclusions.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 12, 2010: Proceedings of the 11th Annual Bioinformatics Open Source Conference (BOSC) 2010. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S12.
- Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, Zaharia M: Above the Clouds: A Berkeley View of Cloud Computing. In Book Above the Clouds: A Berkeley View of Cloud Computing. Edited by: Editor ed.^eds.. University of California at Berkeley; 2009:23.Google Scholar
- Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL: Searching for SNPs with cloud computing. Genome Biol 2009, 10: R134. 10.1186/gb-2009-10-11-r134PubMed CentralView ArticlePubMedGoogle Scholar
- Schatz MC: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 2009, 25: 1363–1369. 10.1093/bioinformatics/btp236PubMed CentralView ArticlePubMedGoogle Scholar
- Wall DP, Kudtarkar P, Fusaro VA, Pivovarov R, Patil P, Tonellato PJ: Cloud computing for comparative genomics. BMC Bioinformatics 2010, 11: 259.PubMed CentralView ArticlePubMedGoogle Scholar
- Schatz MC, Langmead B, Salzberg SL: Cloud computing and the DNA data race. Nat Biotechnol 2010, 28: 691–693. 10.1038/nbt0710-691PubMed CentralView ArticlePubMedGoogle Scholar
- The Galaxy Project[http://galaxyproject.org/]
- Field D, Tiwari B, Booth T, Houten S, Swan D, Bertrand N, Thurston M: Open software for biologists: from famine to feast. Nat Biotechnol 2006, 24: 801–803. 10.1038/nbt0706-801View ArticlePubMedGoogle Scholar
- Taylor J, Schenck I, Blankenberg D, Nekrutenko A: Using Galaxy to perform large-scale interactive data analyses. Current Protocols in Bioinformatics 2007, 19: 10.15.11–10.15.25.Google Scholar
- Amazon Elastic Compute Cloud (Amazon EC2)[http://aws.amazon.com/ec2/]
- Grid Engine[http://gridengine.sunsource.net/]
- Amazon Elastic Block Store (EBS)[https://aws.amazon.com/ebs/]
- Cloud Biolinux[http://www.cloudbiolinux.com/]
- Bio-Linux 6[http://nebc.nerc.ac.uk/tools/bio-linux/bio-linux-6.0]
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009, 10: R25. 10.1186/gb-2009-10-3-r25PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25: 1754–1760. 10.1093/bioinformatics/btp324PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25: 2078–2079. 10.1093/bioinformatics/btp352PubMed CentralView ArticlePubMedGoogle Scholar
- Keahey K, Freeman T: Contextualization: Providing one-click virtual clusters. In IEEE International Conference on eScience. Indianapolis, IN; 2008:301–308. full_textGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.