Instantiating and controlling a cloud cluster
The Galaxy CloudMan application currently supports creation of a compute cluster on Amazon’s EC2 [9] cloud computing infrastructure. The process of instantiating a cluster does not require any computational experience, and requires no compute infrastructure or software beyond the web browser used to control the cluster. Galaxy CloudMan is thus ideal for independent researchers and small labs that have a specific or periodic need for computational resources but lack informatics expertise and commitment to manage and maintain a computational cluster. The process of instantiating a CloudMan compute cluster consists of three steps: (1) create an Amazon Web Services (AWS) account and sign up for the EC2 and S3 services, (2) use the AWS Management Console to start a master EC2 instance, and (3) use the CloudMan web console on the master instance to manage the cluster size. Step one needs to be performed only once, usually by a person controlling the cloud cluster. Steps two and three need to be performed each time running jobs on a compute cluster is desired, but, again, only by the person controlling the cluster. Once set up, additional users may use the cluster simply through the Galaxy web interface without requiring any system accounts or privileges. A single instance of CloudMan controls a single cluster – of potentially variable size – but a single user may create as many CloudMan cluster instances as desired.
Once CloudMan starts, it automatically configures the master instance as a head node of a Sun Grid Engine (SGE) [10] compute cluster but it does not start any additional worker instances or assign persistent storage to the cluster. In the context of cloud computing, compute instances are usually transient, meaning that any changes made to an instance while the instance is alive are lost at instance termination. In order to persist any data uploaded to the cloud or any analysis results, the data needs to be stored on an external data volume. In the case of CloudMan on EC2, Amazon’s Elastic Block Storage (EBS) [11] volumes are used for data persistence.
Once available, the CloudMan web interface (Figure 1) allows a user to configure additional features of the cluster. Currently, the following features are supported: association of a persistent data volume with the cluster, addition of a range of NGS tools (see below), and addition of the Galaxy analysis interface. Without a persistent data volume, a user may use the cluster for a proof-of-concept computation or a one-time analysis. For clusters that are maintained over time, adding persistent storage is initiated with a click of a mouse, with all infrastructure intricacies handled automatically by CloudMan. Similarly, if a completely configured instance of Galaxy is desired for use of a range of NGS tools, it is trivial to do so through the CloudMan UI.
In addition to the user-level cluster functionality, CloudMan makes it easy to exploit what is arguably the most unique and powerful feature of cloud computing - elasticity. Through the CloudMan web interface, one can scale the size of the cloud cluster at runtime by adding or removing worker instances comprising the cluster (Figure 2). Similarly, the size of the persistent data volume (i.e., EBS volume) associated with a cluster can easily be expanded. Within EC2, individual EBS volumes used as persistent data storage medium within CloudMan, have a predefined size. As the use of a given cluster expands, users may consume the space associated with the given cluster. The CloudMan web interface allows ‘growing’ the size of the persistent data volume associated with a cluster. In the background CloudMan orchestrates the following steps to accomplish the task at hand: (1) stop any services using the user data volume, (2) detach the current user data volume from the master instance, (3) snapshot the detached volume, (4) create a new volume of user-specified size based on the snapshot from step 3 and attach it to the master instance, (5) grow the file system on the new data volume, and (6) resume any services.
Because Galaxy CloudMan is built on top of a Bio-Linux machine image, all of the tools available within Bio-Linux can be used on the instantiated cloud cluster. Accessing the Bio-Linux tools is realized through a command line interface - just like on any other compute cluster. As indicated earlier, the SGE job manager is configured and used on the cluster, making it possible for users to simply copy their job scripts to the cloud cluster and run them there - but with the scalability offered through cloud computing.
When a given cluster is no longer needed, the CloudMan web interface is used to terminate all of the services and worker instances. If persistent data storage was associated with the cluster, the data is preserved while the cluster is offline, and made available in the same state once the cluster is instantiated again. It takes only a few minutes to scale up or down a cluster and consume the required amount of resources.
Tool availability
By default, the Galaxy CloudMan is built on top of a Bio-Linux machine image available from CloudBioLinux [12] and thus makes all of the tools packaged by NERC Bio-Linux [13] immediately available. NERC Bio-Linux represents a set of packaged and fully featured bioinformatics tools that enable users to focus on tool usage rather than tool installation and configuration. By building on top of such varied set of bioinformatics tools, one can combine the cluster controlling functionality of CloudMan with the variety of tools. In addition to the tools available through Bio-Linux, a set of NGS tools available through Galaxy are also available for use, including: Bowtie [14], BWA [15], and SAMtools [16]. If a user desires additional tools, we have provided a mechanism for streamlining the tool installation process (see Methods section). A script used to automatically install all the tools available to a default instance of CloudMan cluster is available at https://bitbucket.org/afgane/mi-deployment/; using this script and customizing it to include the desired tools provides a simple method for modifying the capabilities of a cluster instance. The script supports the ability to install additional tools at cluster runtime only or to persist the changes for future cluster invocations.