Cloudgene: A graphical execution platform for MapReduce programs on private and public clouds
- Sebastian Schönherr†1, 2,
- Lukas Forer†1, 2,
- Hansi Weißensteiner1, 2,
- Florian Kronenberg1,
- Günther Specht2 and
- Anita Kloss-Brandstätter1Email author
© Schönherr et al.; licensee BioMed Central Ltd. 2012
Received: 15 May 2012
Accepted: 1 August 2012
Published: 13 August 2012
The MapReduce framework enables a scalable processing and analyzing of large datasets by distributing the computational load on connected computer nodes, referred to as a cluster. In Bioinformatics, MapReduce has already been adopted to various case scenarios such as mapping next generation sequencing data to a reference genome, finding SNPs from short read data or matching strings in genotype files. Nevertheless, tasks like installing and maintaining MapReduce on a cluster system, importing data into its distributed file system or executing MapReduce programs require advanced knowledge in computer science and could thus prevent scientists from usage of currently available and useful software solutions.
Here we present Cloudgene, a freely available platform to improve the usability of MapReduce programs in Bioinformatics by providing a graphical user interface for the execution, the import and export of data and the reproducibility of workflows on in-house (private clouds) and rented clusters (public clouds). The aim of Cloudgene is to build a standardized graphical execution environment for currently available and future MapReduce programs, which can all be integrated by using its plug-in interface. Since Cloudgene can be executed on private clusters, sensitive datasets can be kept in house at all time and data transfer times are therefore minimized.
Our results show that MapReduce programs can be integrated into Cloudgene with little effort and without adding any computational overhead to existing programs. This platform gives developers the opportunity to focus on the actual implementation task and provides scientists a platform with the aim to hide the complexity of MapReduce. In addition to MapReduce programs, Cloudgene can also be used to launch predefined systems (e.g. Cloud BioLinux, RStudio) in public clouds. Currently, five different bioinformatic programs using MapReduce and two systems are integrated and have been successfully deployed. Cloudgene is freely available athttp://cloudgene.uibk.ac.at.
Computer science is becoming increasingly important in today’s genetic research. The accelerated progress in molecular biological technologies puts increasing demands on adequate software solutions. This is especially true for next generation sequencing (NGS) where costs are falling faster than for computer hardware. As a consequence, the accompanying growth of data results in longer execution times of currently available programs and requires new strategies to process data efficiently. The MapReduce framework and especially its open-source implementation Hadoop has become more and more popular for processing and analyzing terabytes of data: Mapping NGS data to the human genome, calculating differential gene expression in RNA-seq datasets or even simpler but time intensive tasks like matching strings in large genotype files1 are already successfully implemented scenarios. With MapReduce, a computation is distributed and executed in parallel over all computer nodes in a cluster, allowing to add or to remove nodes on demand (scale-out principle). The developer is responsible to write the corresponding map and reduce task and the framework itself is taking over the parallelization, fault tolerance of hardware and software, replication and I/O scheduling. Unfortunately, small to medium sized genetic research institutes can often hardly afford the acquirement and maintenance of own computer clusters. An alternative is public cloud computing which offers the possibility to rent computer hardware from different providers like Amazon’s Elastic Compute Cloud (http://aws.amazon.com/ec2/) on demand.
Working with cluster architectures requires a background in computer science for both setting up a cluster infrastructure and executing MapReduce programs, where sometimes a graphical user interface (GUI) is lacking at all. To improve usability, different programs[4–6] have been developed with a focus on a simplified execution. This constitutes a major improvement for scientists with the down side to implement a new GUI for every future MapReduce program. Additionally, concatenating different programs to a pipeline is still hampered.
In this paper we present Cloudgene, a platform to integrate available MapReduce programs via manifest files and to facilitate the use of on-demand cluster architectures in cloud environments. Cloudgene’s biggest advantage lies in simplifying the import and export of data, execution and monitoring of MapReduce programs on in-house (private clouds) or rented clusters (public clouds) and allowing the reproducibility of an analysis or analysis pipeline.
Architecture and technologies
Before launching a new job, data needs to be imported into the distributed file system (Hadoop Distributed File System), whereby the data source has to be selected. Currently, Cloudgene supports a data import from FTP, HTTP, Amazon S3 buckets or direct file uploads. A job can be submitted by specifying the previously imported data and the program-specific parameters. After launching a program, the process can be monitored and all jobs including results are viewable or downloadable. As all data on a cluster in a public cloud are lost on shutdown, Cloudgene automatically exports all results, log data and if specified also imported datasets in parallel to an S3 bucket.
Currently integrated programs
Highly sensitive short read mapping with MapReduce.
A cloud computing tool for calculating differential gene expression in large RNA-seq datasets.
A scalable software pipeline for whole genome re-sequencing analysis.
Quality control for high throughput sequence data in fastq format.
Filters and extracts certain SNPs from genome wide association studies datasets.
Several Hadoop example applications including Sort and Grep.
An AWS-EC2 image that includes a wide range of biological software, programming libraries as well as data sets.
An AWS-EC2 image that enables the usage of all R programming tools via a web interface.
A web application to determine mitochondrial DNA haplogroups.
Integrating existing MapReduce programs
Wall-Times: Cloudgene compared with Amazon Elastic Map-Reduce executing CloudBurst with input data from s3n://elasticmapreduce/samples/cloudburst
2 + 1
4 + 1
8 + 1
Import Data Times
Integrating novel MapReduce programs
For a proof of concept, we implemented a MapReduce pipeline for FastQ pre-processing to check the quality of large NGS datasets similar to the FastQC tool2, in which statistics like the sequence quality, base quality, length distribution, and sequence duplication levels are calculated. Unlike FastQC, which can be executed on a single computer and where due to memory requirements only a sequence subset is used for quality control computations, our implementation has the advantage that the complete set of sequences is included into the statistical analysis. After processing all sequences, a consecutive program transforms the results into meaningful plots and can finally be downloaded as pdf-files. Again, this shows the step functionality of Cloudgene-MapRed in which different programs can be connected to a pipeline (i.e. step 1: calculating statistics, step 2: generate plot). Moreover, fastq input-files can be imported from public Amazon S3 buckets (e.g. raw data from the 1000 genomes public S3 bucket), which is especially useful in combination with Amazon EC2 since data transfer from S3 to EC2 nodes is optimized (see Table2 for measurements of import times).
A further scenario for a simple but time-intensive task is the filtering and extracting of certain rows from a large SNP-genotype file as usually used in genome-wide association studies with a file size of several gigabytes. Again, this case scenario was implemented as a MapReduce job by the authors of this paper and has been integrated into Cloudgene.
All integrated and introduced programs are available for download on our webpage or can be installed from the Cloudgene repository directly.
Launching a stand-alone image
Different file system images (called AMIs in Amazon terminology) exist to provide a convenient way for setting up systems in the cloud: Cloud BioLinux is a suitable file system image in Bioinformatics including a wide range of biological software, programming libraries as well as data and is therefore an excellent basis for bioinformatic computations in the cloud. RStudio is a development environment for R, available as an EC2 image and allows the usage of all R programming tools via a web interface. Both systems have been successfully integrated into Cloudgene enabling scientists to launch these systems via Cloudgene-Cluster. Cloud BioLinux has further been used as the underlying image for the integration of Myrna and Crossbow into Cloudgene, since most of the required software is already installed and the cluster installation process is therefore simplified.
Launching a web application
Besides the mentioned MapReduce scenarios, Cloudgene can also be used to host a user defined web application on a public cloud. We demonstrated this on HaploGrep, a tool to determine mitochondrial DNA haplogroups from mtDNA profiles. Especially for large input data, HaploGrep requires a lot of main memory, thereby making a public cloud node with sufficient main memory an adequate choice. For this purpose a simple shell script was integrated into the setup process to start the HaploGrep web server. This shell script can be defined in the manifest file of Cloudgene-Cluster and is executed automatically on each cluster node.
Although MapReduce enabled a scalable way to process and analyze data, the execution of programs and the overall setup of cluster architectures still includes non-trivial tasks and hampers the spread of MapReduce programs in Bioinformatics. We therefore developed and implemented Cloudgene, which provides scientists a graphical execution platform and a standardized way to manage large-scale bioinformatic projects.
Strengths and limitations
The usage of Cloudgene has several strengths: (1) Programs can be executed via one centralized platform, thereby standardizing the import/export of data, the execution and monitoring of MapReduce jobs and the reproducibility of programs or new defined program pipelines; (2) scientists can decide flexibly in which environment (public or private cluster) a program should be executed depending on the case scenario to guarantee an appropriate level of data security and to reduce data transfer times; (3) an Amazon EC2 cluster can be launched via the Cloudgene web interface, thus simplifying the overall setup and making Cloudgene available in a public cloud; (4) new MapReduce programs can be integrated by using Cloudgene’s plug-in interface without changing the program code at all.
However, Cloudgene has limitations as well: (1) since the main focus of our approach lies on a simplified execution of MapReduce jobs, programs using other paradigms (e.g. MPI or iterative processing) are currently not supported; (2) Cloudgene allows the concatenation of jobs to simple pipelines with the limitation that pipelines have to be executed from start to end and currently does not allow the execution of specific pipeline steps automatically (e.g. restart at step 3); (3) since the amount of included cluster nodes must be set at start, the launched cluster architectures are currently static and are not changeable during runtime.
Comparison with similar software packages
To date, several approaches exist to improve the usability of currently available bioinformatic solutions. Systems such as Galaxy[14, 15], GenePattern, Ergatis, Mobyle and Taverna[19, 20] try to facilitate the creation, execution and maintainability of workflows in a fast and user-friendly way. In contrast to these existing workflow platforms, Cloudgene’s primary focus lies on the usability of MapReduce jobs in public and private clouds for bioinformatic applications.
Galaxy CloudMan is a similar approach to Cloudgene-Cluster and supports users to set up cloud clusters using Amazon EC2. It works in combination with Bio-Linux (http://nebc.nerc.ac.uk/tools/bio-linux) and configures at start up the Oracle Grid Engine as well as Galaxy. Unfortunately, CloudMan does not support MapReduce by default and therefore a graphical execution and monitoring of jobs is not possible.
Another mentionable and useful system is Amazon Elastic MapReduce (EMR), which provides the opportunity to create job-flows including custom jars, streaming or Hive/Pig programs. Since everything is located on Amazon directly, a highly optimized version of Hadoop MapReduce in combination with Amazon S3 is provided and can be executed by a comprehensive user interface. Nevertheless, Amazon Elastic MapReduce can only be used in combination with Amazon EC2, sometimes preventing research institutes from using it due to data security rules or the enormous amount of data to transfer3 from their own institutional cloud. In contrast, Cloudgene allows launching MapReduce jobs both on public or private clouds, thereby enabling the user to define the location of data. Since Cloudgene does not utilize EMR for its job execution, additionally financial costs for EMR can be saved. Table2 summarizes the comparison of Cloudgene and EMR and shows that Cloudgene is competitive regarding cluster set-up, job execution and data transfer.
CloVR is a virtual image that provides several analysis pipelines to use on a personal computer as well as on the cloud. It utilizes the Grid Engine (http://gridengine.org) for job scheduling and plans a possible future integration of Hadoop MapReduce. Eoulsan is a modular framework which enables the setup of cloud computing clusters and automates the analysis of high throughput sequencing data. A modular plug-in system allows the integration of available algorithms. Eoulsan uses EMR for the execution of their MapReduce jobs and has to be executed on the command-line. Both systems improve the usage of programs for scientists, but are not focused on a graphical execution of jobs on public and private clusters.
The success of a platform like Cloudgene goes hand in hand with the amount of involved users and scenarios. Therefore, our short-term focus will be on an extension of Cloudgene with new case scenarios, hopefully motivating users integrating their own MapReduce programs or systems. One of the biggest advantages of public clouds is the opportunity to rent as many computer nodes and thus computational power as needed. Thus, the next version of Cloudgene is conceived to provide functions for adding and removing computer nodes during runtime. Furthermore, a simple user interface for Hadoop is not only useful for end users but also for developers. It supports them during prototyping and testing of novel MapReduce programs by highlighting performance bottlenecks. Thus, we plan to implement time measurements on the map, reduce and shuffle phase and to visualize them in an intuitive chart. Additionally, Hadoop plans in its future version the support of alternate programming paradigms, which is particularly important for applications where custom frameworks outperform MapReduce by an order of magnitude (e.g. iterative applications like K-Means).
We presented Cloudgene, a platform that allows scientists to set up a user-defined cluster in the cloud and to execute or monitor MapReduce jobs via a dynamically created web interface on the cluster. Cloudgene’s aim is to integrate existing and future MapReduce programs via a manifest file into one centralized platform. Cloudgene supports users without deeper background in computer science and improves the usability of currently available MapReduce programs in the field of Bioinformatics. We think that this approach improves the utilization of programs and the reproducibility of results. Additionally, we showed on different scenarios how an integration can be fulfilled without adding overhead to the computation, thereby improving both the development and the usability of a program.
Availability and requirements
Project name: Cloudgene
Project home page: http://cloudgene.uibk.ac.at
Operating System: Cloudgene-Cluster (platform-independent), Cloudgene-MapRed (GNU/Linux)
Other requirements: Java 1.6, AWS-EC2 Account for public clouds, Hadoop MapReduce for private clouds
License: GNU GPL v3
Any restrictions to use by non-academics: None
3 Amazon Web Services (AWS) provides possibility to ship data on hard-drives.
4 In-house implementation.
SS was supported by a scholarship from the University of Innsbruck (Doctoral grant for young researchers, MIP10/2009/3). HW was supported by a scholarship from the Autonomous Province of Bozen/Bolzano (South Tyrol). The project was supported by an Amazon Research Grant, the grant ‘Aktion D. Swarovski’ and by the Österreichische Nationalbank (Grant 13059) as well as the Sequencing and Genotyping Core Facility of the Innsbruck Medical University. We thank the open source and free software community as well as the Apache Whirr Mailing list, especially Tom White and Andrei Savu for their great assistance.
- DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Program. http://www.genome.gov/sequencingcosts
- Dean J, Ghemawat S: MapReduce: Simplified data processing on large clusters. Commun ACM 2008, 51(1):107–113. 10.1145/1327452.1327492View Article
- Apache Hadoop;[http://hadoop.apache.org]
- Schatz MC: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 2009, 25(11):1363–1369. 10.1093/bioinformatics/btp236PubMed CentralView ArticlePubMed
- Langmead B, Hansen KD, Leek JT: Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol 2010, 11(8):R83. 10.1186/gb-2010-11-8-r83PubMed CentralView ArticlePubMed
- Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL: Searching for SNPs with cloud computing. Genome Biol 2009, 10(11):R134. 10.1186/gb-2009-10-11-r134PubMed CentralView ArticlePubMed
- Apache Whirr;[http://whirr.apache.org/]
- Amazon Elastic MapReduce;[http://aws.amazon.com/elasticmapreduce/]
- Krampis K, Booth T, Chapman B, Tiwari B, Bicak M, Field D, Nelson K: Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community. BMC Bioinformatics 2012, 13(1):42. 10.1186/1471-2105-13-42PubMed CentralView ArticlePubMed
- RStudio AMI;[http://www.louisaslett.com/RStudio_AMI]
- Kloss-Brandstätter A, Pacher D, Schönherr S, Weissensteiner H, Binna R, Specht G, Kronenberg F: HaploGrep: a fast and reliable algorithm for automatic classification of mitochondrial DNA haplogroups. Hum Mutat 2011, 32(1):25–32. 10.1002/humu.21382View ArticlePubMed
- Goecks J, Nekrutenko A, Taylor J: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 2010, 11(8):R86. 10.1186/gb-2010-11-8-r86PubMed CentralView ArticlePubMed
- Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J: Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol 2010, Chapter 19(Unit 19.10.1–21):11–21.
- Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP: Gene Pattern 2.0. Nat Genet 2006, 38(5):500–501. 10.1038/ng0506-500View ArticlePubMed
- Orvis J, Crabtree J, Galens K, Gussman A, Inman JM, Lee E, Nampally S, Riley D, Sundaram JP, Felix V, et al.: Ergatis: a web interface and scalable software system for bioinformatics workflows. Bioinformatics 2010, 26(12):1488–1492. 10.1093/bioinformatics/btq167PubMed CentralView ArticlePubMed
- Neron B, Menager H, Maufrais C, Joly N, Maupetit J, Letort S, Carrere S, Tuffery P, Letondal C: Mobyle: a new full web bioinformatics framework. Bioinformatics 2009, 25(22):3005–3011. 10.1093/bioinformatics/btp493PubMed CentralView ArticlePubMed
- Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T: Taverna: a tool for building and running workflows of services. Nucleic Acids Res 2006, 34(Web Server issue):W729-W732.PubMed CentralView ArticlePubMed
- Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, et al.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 2004, 20(17):3045–3054. 10.1093/bioinformatics/bth361View ArticlePubMed
- Afgan E, Baker D, Coraor N, Chapman B, Nekrutenko A, Taylor J: Galaxy Cloud Man: delivering cloud compute clusters. BMC Bioinformatics 2010, 11(Suppl 12):S4. 10.1186/1471-2105-11-S12-S4PubMed CentralView ArticlePubMed
- Oracle Grid Engine;[http://www.oracle.com/technetwork/oem/grid-engine-166852.html]
- Angiuoli SV, Matalka M, Gussman A, Galens K, Vangala M, Riley DR, Arze C, White JR, White O, Fricke WF: CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics 2011, 12(1):356. 10.1186/1471-2105-12-356PubMed CentralView ArticlePubMed
- Jourdren L, Bernard M, Dillies MA, Le Crom S: Eoulsan: A Cloud Computing-Based Framework Facilitating High Throughput Sequencing Analyses. Bioinformatics 2012, 28(11):1542–1543. 10.1093/bioinformatics/bts165View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.