The cost of genome sequencing has been rapidly decreasing due to the introduction of a number of new affordable next-generation sequencing technologies. Coupled with the decreasing costs is an increase in the volume of data produced by sequencing machines. As a consequence, the genomics field has been rapidly changing: Larger amounts of sequence data are not only being produced at lower costs, but also more and more often by small to midsize research groups outside of the large sequencing centers all over the world . This is a trend, which is likely to continue, as newer generation sequencing technologies continue to drive down costs.
High-throughput sequencing technologies have decentralized sequence acquisition, increasing the number of research groups in demand of sequence analysis all over the world. The increasing volume of data from next-generation sequencing has led to increased computational and bioinformatics needs and concern of a bioinformatics bottleneck . Technical challenges in use of bioinformatics software [3, 4] and difficulties in utilization of available computational resources [5, 6] impede analysis, interpretation and full exploration of sequence data.
The installation, operation, and maintenance of software tools for bioinformatics analysis can be cumbersome and require significant technical expertise leading to efforts that pre-package and bundle bioinformatics tools . While, many bioinformatics software tools routinely used in sequence analysis are open source and freely available, the installation, operation, and maintenance can be cumbersome and require significant technical expertise [3, 7]. In addition, individual tools are often insufficient for sequence analysis and, rather, need to be integrated with others into multi-step pipelines for thorough analysis. To aid with this, bioinformatics workflows systems and workbenches, such as Galaxy , Ergatis , GenePattern , Taverna  provide user interfaces to simplify execution of tools and pipelines on centralized servers. Prior to analysis, researchers utilizing genomics approaches are faced with a multitude of choices of analysis protocols and best practices are often poorly documented . Complexities of analysis pipelines and lack of transparent protocols can limit reproducibility of computed results . Use of workbenches that store pipeline metadata and track data provenance can improve reproducibility .
Bioinformatics service providers, such as RAST , MG-RAST , ISGA , and the IGS Annotation engine , have attempted to address challenges in microbial genome analysis by providing centralized services, where users submit sequence data to a web site for analysis using standardized pipelines. In this model, the service provider operates the online resource, dedicating the necessary personnel and computational resources to support a community of users. Bioinformatics workflow systems [8–11] also operate on central servers, utilizing dedicated or shared network based storage, and clusters of computers for improved processing throughput. Other efforts have bundled tools into portable software packages for installation on a local computer, including Mothur  and Qiime  for 16S ribosomal RNA analysis and DIYA  for bacterial genome annotation. A virtual machine (VM) encapsulates an operating system with pre-installed and pre-configured software in a single executable file that can be distributed and run elsewhere. VMs provide a means to eliminate complex software installations and adaptations for portable execution, directly addressing one of the challenges involved with using bioinformatics tools and pipelines.
Cloud computing offers leasable computational resources on-demand over a network . The cloud computing model can simplify access to a variety of computing architectures, including large memory machines, while eliminating the need to build or administer a local computer network addressing challenges in access and deployment of infrastructure for bioinformatics [6, 21]. Cloud computing platforms have been emerging in the commercial sector, including the Amazon Elastic Compute Cloud (EC2) , and in the public sector to support research, such as Magellan  and DIAG . In combination with virtual machines, cloud computing can help improve accessibility to complex bioinformatics workflows in a reproducible fashion on a readily accessible distributed computing platform .
There is considerable enthusiasm in the bioinformatics community for use of cloud computing in sequence analysis [6, 21, 25, 26]. While cloud computing platforms that provide ready access to computing resources over the Internet on-demand can improve processing throughput, utilization of bioinformatics tools and pipelines on such distributed systems requires technical expertise to achieve robust operation and intended performance gains [6, 27]. Cluster management software, workflow systems, or databases may be installed, maintained, and executed across multiple machines. Also, challenges in data storage and transfer over the network add to the complexity of using cloud computing systems .
Map-Reduce algorithms  using the cloud-ready framework Hadoop are available for sequence alignment and short read mapping , SNP identification , RNA expression analysis , amongst others demonstrating the usability of cloud services to support large-scale sequence processing. Despite emergence of methods in cloud-ready frameworks, many bioinformatics tools, analysis pipelines, and standardized methods are not readily transferable to these frameworks but are trivially parallelized using batch processing systems [8, 9].
In this paper, we describe a new application, Cloud Virtual Resource (CloVR), that relies on two enabling technologies, virtual machines (VMs) and compute clouds, to provide improved access to bioinformatics workflows and distributed computing resources. CloVR provides a single VM containing pre-configured and automated pipelines, suitable for easy installation on the desktop and with cloud support for increased analysis throughput.
In building the CloVR VM, we have addressed the following technical challenges in using cloud computing platforms:
i) Elasticity and ease-of-use: clouds can be difficult to adopt and use requiring operating system configuration and monitoring; many existing tools and pipelines are not designed for dynamic environments and require re-implementation to utilize cloud-ready frameworks such as Hadoop;
ii) Limited network bandwidth: Internet data transfers and relatively slow peer-to-peer networking bandwidth in some cloud configurations can incur performance and scalability problems; and
iii) Portability: reliance on proprietary cloud features, including special storage systems can hinder pipeline portability; also, virtual machines, while portable and able to encapsulate complex software pipelines, are themselves difficult to build, configure, and maintain across cloud platforms.
The architecture of CloVR addresses these challenges by
i) simplifying use of cloud computing platforms by automatically provisioning resources during pipeline execution;
ii) using local disk for storage and avoiding reliance on network file systems;
iii) providing a portable machine image that executes on both a personal computer and multiple cloud computing platforms;
In the presented work, we evaluate four features of the CloVR architecture: portability across different local operating systems and remote cloud computing platforms, support for elastic provisioning of local and cloud resources, scalability of the architecture, use of local data storage on the cloud, and built process of the CloVR VM from base image using recipes.