QMachine: commodity supercomputing in web browsers
© Wilkinson and Almeida; licensee BioMed Central Ltd. 2014
Received: 19 February 2014
Accepted: 27 May 2014
Published: 9 June 2014
Skip to main content
© Wilkinson and Almeida; licensee BioMed Central Ltd. 2014
Received: 19 February 2014
Accepted: 27 May 2014
Published: 9 June 2014
Ongoing advancements in cloud computing provide novel opportunities in scientific computing, especially for distributed workflows. Modern web browsers can now be used as high-performance workstations for querying, processing, and visualizing genomics’ “Big Data” from sources like The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) without local software installation or configuration. The design of QMachine (QM) was driven by the opportunity to use this pervasive computing model in the context of the Web of Linked Data in Biomedicine.
QM is an open-sourced, publicly available web service that acts as a messaging system for posting tasks and retrieving results over HTTP. The illustrative application described here distributes the analyses of 20 Streptococcus pneumoniae genomes for shared suffixes. Because all analytical and data retrieval tasks are executed by volunteer machines, few server resources are required. Any modern web browser can submit those tasks and/or volunteer to execute them without installing any extra plugins or programs. A client library provides high-level distribution templates including MapReduce. This stark departure from the current reliance on expensive server hardware running “download and install” software has already gathered substantial community interest, as QM received more than 2.2 million API calls from 87 countries in 12 months.
QM was found adequate to deliver the sort of scalable bioinformatics solutions that computation- and data-intensive workflows require. Paradoxically, the sandboxed execution of code by web browsers was also found to enable them, as compute nodes, to address critical privacy concerns that characterize biomedical environments.
A supporting QM deployment is available at https://v1.qmachine.org, and its source code is available at https://github.com/wilkinson/qmachine. The illustrative examples and their dependencies are provided for live demonstration at http://q.cgr.googlecode.com/hg/index.html along with a screencast and archived genomic data.
High-performance computing (HPC) for the life sciences is undergoing a fundamental reshaping . The reliance on processor-intensive resources through which ever-enlarging genomics workflows are funneled is giving way to distributed data-intensive infrastructures like TCGA and ICGC . Accordingly, the immovable volumes that are flooding data centers demand “beyond the data deluge” solutions  that invert the traditional transfer model so that computations travel to the data and not vice versa. The emphasis, then, is to maximize the availability of the data and the portability of the application. The increasing use of cloud computing infrastructure for biomedical applications reflects the realignment of HPC, as exemplified by the recent partnership between the National Institute for Health (NIH) and Amazon on the 1000 Genomes Project .
At the same time, HPC projects such as SETI@home , Folding@Home , and BOINC  have constructed distributed platforms that aggregate commodity hardware and volunteer compute cycles in order to power computationally intensive scientific workflows. In fact, the Folding@Home project currently utilizes the central and/or graphics processing units from more than 250,000 personal computers and video game consoles . To orchestrate concentrated efforts across such large numbers of physical machines and hardware platforms, researchers provide client applications that they must persuade volunteers to download and install permanently on their machines. These applications range in invasiveness from programs that run only when a machine is idle, such as the pioneering SETI@home, to always-on background services like Condor  that may tangibly impact a machine’s performance.
The World Wide Web provides a different avenue for HPC, and this is what we explore with QM – a novel direction. The temptation to optimize QM for a particular problem domain was overcome by the greater challenge of creating a system not only to distribute computation across the Web, but also to be “of the Web” itself. A careful study of the Web as a platform reveals that the necessary components are indeed ready for assembly.
Web browsers execute JS in sandboxed environments that rigorously control access to machine resources, and now those sandboxes implement standardized APIs that provide native capabilities like hardware-accelerated 3D graphics. All modern browsers and even a few browser plugins include just-in-time (JIT) compilers to boost performance . Regular expressions in JS, for example, perform at levels that are no longer matched by Perl , the language most often associated with string processing in bioinformatics applications. Moreover, these high-performance JS environments come pre-installed on every personal computer sold today, as well as on smartphones, tablets, gaming consoles, and even televisions. Thus, web browsers represent a modern route for high-performance computing that is well-suited for the “crowdsourcing” model . Indeed, the current fast proliferation of bioinformatics libraries in JS also reflect the advent of web-based “social coding” environments which present entirely novel opportunities for large-scale collaboration . Furthermore, the networking capabilities of the browser platform allow it to import code and data dynamically and thereby orchestrate distributed workflows across multiple browsers on distinct machines, a feature at the core of social computing . Therefore, what is described in this report could be construed as social computing for machines , extending the reach of loose distribution models such as mGrid .
The emergence of Big Data in the biomedical sciences has been associated with the proliferation of reference databases such as those reviewed yearly by Nucleic Acids Research. The aggregation of the Web of Linked Data resources independent of the institutions that host them has been approached by comprehensive data models such as the Distributed Annotation System , which we have also explored as a backbone for workflow assembly . It is now amply clear, however, that the linking of data resources, regardless of the domain, is itself domain-neutral and best described by dyadic predicates of W3C’s Resource Description Framework (RDF) that underlies the third generation of Web technologies [21, 26, 27].
The RDF approach has expanded the basic reliance of unique resource identifiers (URIs) both to identify and locate data (via URLs) which require only a web browser to be put to good use by any researcher, regardless of his expertise or domain of interest. The current extent of its use is dramatically illustrated by the adoption of the RDF framework across all data services of the European Bioinformatics Institute . As also illustrated by some of our work [29–31] developing SPARQL endpoints for TCGA, the volume of the server-side hosted data is not a significant obstacle to developing web applications (“web apps”) that consume those data. On the other hand, mechanisms to assemble workflows for data analysis have not yet matured as user-friendly commodities, despite the availability of excellent frameworks like Taverna  and SHARE . One possibility is that the underlying web services themselves need to be amenable to assembly at a moment’s notice – even for deprecated or outdated versions of a procedure. This is an absolute requirement of the modern focus on reproducibility of workflow results . We have explored the use of modular browser-based web apps to deliver this functionality in standard bioinformatics applications such as image analysis  and sequence analysis . The success in those two efforts strengthens the claim that script tag loading, the same mechanism web browsers use to load web apps, can orchestrate and distribute the execution of bioinformatics workflows across multiple physical machines. The illustrative, and validating, example detailed in the Results section below will extend the same example of sequence analysis approached in the second of those two reports by analyzing twenty different genomes of Streptococcus pneumoniae in parallel.
To provide this functionality, QM contains three main components: an API server, a web server, and a website. The API and web servers are written completely in JS, and the website is written in HTML5, CSS, and JS. Nothing about QM’s design or interface binds it to a particular development stack, but our desire to construct the project as a true Web Computing “device” motivated us to implement as much of the code in JS as possible. The strategy paid unexpected dividends, as well; the server-side components are free from assumptions about the hardware and operating systems on which they run, which vastly simplifies deployment to the cloud via Platform-as-a-Service (PaaS) .
The API server is implemented as a simple Node.js  program that loads and executes all of its application logic from QM’s own publicly available module, “qm”, using the Node Package Manager (NPM) . The module supports five different databases as targets for persistent storage: Apache CouchDB , MongoDB , PostgreSQL , Redis , and SQLite . These five open-source databases were chosen for support based on their high performance and popularity, and their differences in design help to guide the development of QM as an HPC solution for a heterogeneous database landscape. The relative merits of the alternative implementations to the default use of MongoDB are as follows. CouchDB and MongoDB are both document-centric NoSQL databases with MapReduce APIs that understand JS, but their designs are very different. CouchDB is more than just a database – it is nearly sufficient to implement QM by itself because it bundles a web server and an HTTP-accessible API. MongoDB, by way of contrast, has an API that mimics the traditional relational style used by PostgreSQL and SQLite, with a much stronger focus on clustering and “sharding” (horizontal partitioning) across nodes. PostgreSQL represents relational database management systems (RDBMS), the workhorses that traditionally power enterprise applications and data warehouses, while SQLite represents embedded (serverless) databases. Redis is an in-memory key-value store that is often referred to as a “data structure server” because its keys can contain strings, hashes, lists, sets, and sorted sets. The ability to map QM’s persistent representation layer across such a wide variety of storage systems simplifies deployment and maintenance significantly. The service that backs this report’s illustrative examples, available at https://v1.qmachine.org, uses MongoDB.
QM’s API server supports Cross-Origin Resource Sharing (CORS)  so that any webpage can embed QM to distribute workflows across web browsers without violating the Same-Origin Policy . There is wide support for CORS in web browsers .
The web server, like the API server, is implemented as a Node.js program, and its logic is contained in the same NPM module, “qm”. That is, the installation of all of QMachine’s base libraries can be achieved simply by running Node’s built-in module management system: npm install qm. It is worth recalling the minimal role played by the server-side components of QM (see Figure 1 in Results). The web server exists only to provide the presentation/analytical layer’s resources to client machines. Because these resources are static, the web server can be replaced by off-the-shelf web servers like Apache  and Nginx .
When a browser loads the webpage, it initially loads only the presentation layer, comprised of the HTML, CSS, and JS resources necessary to render the graphical user interface (GUI). Immediately after the GUI loads, the browser retrieves QM’s analytical layer, which is written entirely in JS. This design improves the user experience by loading the GUI faster, and it isolates the presentation layer’s code from the analytical layer’s code. Thus, third parties can embed QM’s analytical layer and thereby use QM’s persistent representation layer without loading QM’s presentation layer, as shown by the examples at https://v1.qmachine.org/barebones.html and http://q.cgr.googlecode.com/hg/index.html.
QM’s browser client models a workflow as a set of transforms that should be applied to input data in a specific order to produce output data. A “task description” is an object that contains the transform f, the data x, and any information needed to prepare the environment prior to execution.
As described above, the client-side application that is distributed when a browser is pointed to https://v1.qmachine.org was developed using only web technologies: HTML5, JS, and CSS. In order to stay within the core JS syntax that is supported by all browsers and all platforms – including mobile devices – code development was assisted by JSLint . JSLint is also used directly within the analytical layer as a static analysis tool to identify tasks which can be serialized faithfully into JSON for distribution to volunteer machines. A generic library, Quanah , was also developed to solve the numerous concurrency challenges faced in asynchronous data transfer by QM; it is therefore a key component of the prototype described here and is accordingly also made publicly available with open source. The presentation layer uses jQuery  and Twitter Bootstrap  to ensure consistent look-and-feel across a variety of mobile and desktop browsers. The GUI additionally attempts to support outdated browsers through optional integration with Google Chrome Frame , HTML5 Shiv , and json2.js , but it does so only as a courtesy.
The architecture of a QMachine detailed in Figure 1 follows the general pattern of Web 3.0 technologies by using the server side exclusively for persistent representation and leaving the rest of the program logic to run on the client side. QM uses a key-value architecture to orchestrate volunteering client machines in a manner that maximizes the distribution of the computational resources required for data transfer and subsequent data processing. This orchestration is highlighted in Figure 1: QM distributes not only the compute cycles needed to execute the n different procedures (Σ 1,2,…,n ), but also the bandwidth needed the retrieve the corresponding input data (D 1,2,…,n ) being processed from their respective URLs (d 1,2,…,n ). This design is motivated by the constraints of biological applications such as next generation sequencing in which the limiting factor is more often the available memory than the processor speed.
Once loaded, the JS environment will contain a global object named QM with convenient high-level methods that can be used to reproduce the results of the four examples that follow.
As discussed in “Methods”, QM’s architecture does not impose the use of a specific programming language, as long as a compiler to JS, the web’s “assembler language” , is distributed with the remote call. To support this claim, the QM client library delegates to a compiler – written in JS – for the CoffeeScript language. For a list of compilers that can translate programs written in other languages into JS so that they can be interpreted by volunteering browsers, see http://bit.ly/altjsorg.
The fourth illustrative example assesses QM’s ability to scale the asynchronous operations demonstrated above for use in a real-world bioinformatics workflow. The example is a Fractal MapReduce decomposition of sequence alignment  which distributes both the processing and networking loads across QM’s volunteers, as described in Figure 1. It also demonstrates that libraries of any complexity or elaboration can be distributed to the volunteers along with the commands that invoke those libraries. Specifically, both the data and the library encoding for the sequence analysis procedure are invoked by QM but entirely resolved and executed by the volunteer browsers. It also illustrates the ability of a volunteer node to call code and data from multiple sources which are independently developed and maintained.
A full version of these examples can be found online at http://q.cgr.googlecode.com/hg/index.html. The version there includes the full URLs to all twenty Streptococcus pneumoniae genomes and also to the versioned libraries specified by env. An accompanying screencast for these examples is also provided in that page.
The server load associated with orchestrating this initial heavy use of QM is very modest because of the reliance on code distribution rather than on code execution. In fact, the deployment supporting the usage statistics described above (the server behind https://api.qmachine.org) was never overwhelmed by traffic spikes even though it was running on a shared-tenant virtual machine with just 512 MB of RAM, 2×512 MB MongoDB databases, and no hard drive. Furthermore, the authors do not incur any maintenance costs for the public tool dissemination, either from GitHub or from NPM’s package repository. We are therefore committed not to collect any data beyond the broad statistics described in Figures 3 and 4 for the reference deployment discussed here. Particularly relevant for the biomedical usage scenario that motivated this work, we are also committed not to collect any data at all from private deployments of QM; in other words, no part of QM’s software ever sends data back to our server(s) from other deployments. This design allows administrators to deploy their own QM servers through NPM and fully configure their own security as needed for clinical and/or biomedical research usage. These assurances can, of course, be verified through inspection of QM’s source code.
Many researchers with access to large-scale computational resources still find those resources inaccessible because “everyday” workflows often require more than just fast computers – they require programming skills that are harder to acquire. Bioinformatics workflows increasingly rely on MapReduce as an abstraction, but available MapReduce resources still expose researchers to programming environments with strict procedural requirements and steep learning curves. QM is much simpler to set up and operate than Apache Hadoop , for example. It allows users to run MapReduce jobs on multiple physical machines and to crowdsource elastic computing resources with the simplicity of writing and loading a webpage – skills performed every day by millions of people worldwide. We argue that using the web computing architecture explored by QM – that is, without installing a dedicated application – is a natural evolution of current cloud-based MapReduce services, just as Hadoop was a step up from one-off compile-and-run workflows.
QM’s web service provides a message passing interface for distributed computing. This statement may at first sound paradoxical, but JS’s single-threaded programming model does not limit JS programs to single-threaded execution; external execution contexts can be used to support concurrency via event-driven programming. QM leverages browsers’ asynchronous (non-blocking) network communications layers to connect multiple machines’ execution contexts, but browsers that support Web Workers  can execute concurrent programs within the same physical machine.
An interesting new twist in the development of web computing architectures is the emergence of the “cloud browser” . In these systems, a mobile browser behaves as a thin client when a webpage’s scripts demand heavy computation. Cloud browsers therefore demonstrate browser scaling in the vertical direction, whereas QM demonstrates browser scaling in the horizontal direction. Because QM makes no assumptions about its volunteers’ underlying resources, cloud browsers can volunteer for QM alongside ordinary browsers without loss of generality. In other words, cloud browsers represent enhancements of present-day browsers, while QM presents a solution for HPC that advances the underlying architecture of the Web towards that of a Global Computer [64, 65].
In clinical environments, it can be difficult or even impossible to distribute workflows due to privacy concerns that prevent sensitive data from leaving the hospital environment, where conventional HPC is typically absent. QM satisfies this preoccupation without requiring additional resources. As shown in Figure 5, the median computational power of the Top500 HPC in November 2013 (http://goo.gl/XIUIDP) was roughly 2,600 times faster than our lab’s standard-issue desktop machine. This is a much smaller factor than the number of machines in a typical medical center. Thus, even if restricted to a single hospital environment, volunteer computing can still rival the total capacity of very substantial HPC resources.
QM can also be used to power workflows inside of a single workstation. In such a scenario, the workstation would run QM’s API server locally and use multiple browser tabs to execute the workflow in parallel. Such a workflow might also incorporate existing bioinformatics tools such as the Basic Local Alignment Search Tool (BLAST)  by using traditional server-side scripting languages like Perl  or Python  to connect to QM’s API or even directly to the persistent storage layer.
The security of workflows that use QM is handled orthogonally to QM by the selection of volunteers and by access control to code and data. A number of considerations should nevertheless be made to assist in the configuration of its distributed operation. It is important to recall that the web browser executes JS within a sandboxed environment, which, among other protections, prevents programmatic access to the filesystem of the volunteer machine. As a result, QM’s security is configured around two firewalls.
The first and most basic protection is associated with the uniqueness of the “box” (token) issued by the submitter, which should be shared only with trusted volunteers. An additional layer of security can be added through the use of open authentication such as OAuth 2.0  to verify that only trusted volunteers are involved. This second layer of protection is particularly useful in creating audit trails. These two mechanisms can be combined in many ways, as appropriate for a particular workflow. For example, different steps of a workflow could be assigned to distinct cohorts of volunteers depending on the sensitivity of the code and data and/or the trustworthiness of the volunteers. The resulting granularity could also be used to build redundancy – and therefore robustness – into the distributed QM operation.
In short, the weakest link in QM’s architecture – and where the opportunities for abuse lie – derive from the sharing of the “box” by members of a group of volunteers. In this regard, the key feature of QM’s design is that the abuse can target the submitters but not the volunteers, because QM’s operations take place within the sandbox of the web browser.
QMachine was developed to respond to the challenges of – and to capitalize on the opportunities of – bioinformatics applications encountered in biomedical environments. For more than a decade, volunteer computing has enticed computational biology as a scalable and cost-effective high-performance computing solution. QM essentially ports that solution to the modern computational landscape, which is increasingly dominated by mobile hardware platforms and the use of the web browser as the universal software platform. The features of the modern web browser go beyond those that make it a high-performance computational environment with advanced communication layers; they also include the transformative feature that computations run in a robust sandbox that prevents access to the underlying machine’s potentially sensitive filesystem. QM also responds to another modern trend towards engaging HPC resources through the use of the MapReduce programming pattern, rather than through direct interactions with compute nodes. The sequence analysis application that illustrates the use of QM in this report offers the sort of immediate utility that would benefit bioinformatics applications in Medical Genomics. It is argued, however, that QM, as an “of the Web” distributed computing system, may be just as useful in the identification of the fundamental features of pervasive web computing.
The Streptococcus pneumoniae genome data are used directly from the publicly available online repository at http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/, and the relevant FASTA files have also been archived to http://q.cgr.googlecode.com/hg/data/, a version-controlled repository. The original data used to produce Figure 5 were taken from http://s.top500.org/static/lists/xml/TOP500_201311_all.xml and are archived to http://q.cgr.googlecode.com/hg/data/.
All source code for this paper is version-controlled and open-sourced. The primary source for QMachine’s code is located in a Git  repository at https://github.com/wilkinson/qmachine. The code and data for the illustrative examples shown in the Results section are available in a Mercurial  repository at http://q.cgr.googlecode.com/hg/. Quanah’s source code repository is available at https://github.com/wilkinson/quanah, and the USM repository is available at https://github.com/usm/usm.github.com.
This work was supported in part by the Center for Clinical and Translational Sciences of the University of Alabama at Birmingham under contract no. 5UL1RR025777-03 from NIH National Center for Research Resources. This work was also supported in part by an NCI T32 Trainee Grant at Rice University under contract no. 5T32CA096520-05.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.