In this section, we focus on describing the process to take a desktop bioinformatics application, scale it to tens of thousands of cores on supercomputers, and expose it to a science gateway for easy access. We look at the complexity of the code and various profiling options. We also identify the computationally intensive and data intensive sections of the code, and examine parallelizing the code with no change in the functionality of the application as described in following sections.
Profiling for parallelization
The unique design of petascale supercomputers, with performance exceeding one petaflop, tens of thousands of computing cores, low per-core systems memory, and a reliance on networked distributed filesystems makes the architecture different from clusters and desktop machines. Efficient use of these machines requires special operating systems and carefully designed and tuned applications. One possible way to improve the performance of tools that run on petascale computers is by changing the actual algorithms to better suit the characteristics of supercomputers; this approach involves the tedious process of recoding already complex and specialized tools. The alternative approach that we have taken is to profile the code so as to identify the computationally intensive functions and I/O intensive functions.
We use CrayPat [25], PAPI [26], gprof [27], and our own timing code to analyze the runtime of the tools and eventually our own optimizations as well. The CrayPat profiler was useful in determining possible improvements to the data output scheme and improvements to the overall MPI communication. A second profiler, gprof, was used to analyze the runtime of the tools themselves. A great deal of source code is involved with each application; for example, NCBI's BLAST code base is approximately 1.3 million lines, HMMER at 35,000, and MUSCLE with nearly 28,000 lines. Because of the extensive code base of these tools, as few changes as possible should be made to the code so as to avoid the need to maintain millions of lines of forked software.
After profiling and examining thousands of lines of code, we decided not to alter the functional part of any of these tools so we could avoid the time investment necessary for understanding and parallelizing the tool for advanced architectures. Instead, we created a software wrapper that requires only a very small number of changes to each tool's original source code. The architecture of the wrapper solution is shown in Figure 1. The wrapper is an executable that runs outside of BLAST, HMMER, and MUSCLE, but serves to handle the I/O of these tools in an efficient manner, leaving the core of the application as an untouched "black box". The main difficulty in scaling these tools is management of I/O access patterns. The wrapper uses shared memory to redirect the tool's I/O to the wrapper process, so that it may handle the I/O more efficiently. We chose shared memory segments as the means of inter-process communication due to its low overhead, the ease with which pseudo file I/O can be implemented, and the optimization opportunities it provides as described later. Many supercomputers use a distributed filesystem, such as the Lustre distributed filesystem [28], for I/O data storage. Each instance of the tool needs to load a copy of the database, but the file system does not perform well if thousands of processes are trying to read the same files from it simultaneously. A better design is to load all of the data needed with one reader and then broadcast the data to all of the worker nodes. This ensures that the data is only read from disk once. In our solution, we load data from the file system to the master node, and use MPI's MPI_Bcast function to broadcast data from the master node to all worker nodes that require the input databases and configuration files. This broadcast is more efficient, scaling logarithmically in the number of nodes. Additionally, the use of shared memory allows for a single image of the database to be shared by multiple processing cores. The worker nodes prefetch query sequences from the master node in order to hide input latency while achieving dynamic load balancing, which is essential due to the variance in query runtimes [16].
Optimization for improved data access
Another common issue related to file access patterns is the reliance on sequential loading of files. In this model, a file is memory-mapped into the process's virtual address space and then read sequentially. A page fault occurs each time the program attempts to read data that has not yet been loaded from the filesystem. Then this next page of data is loaded into memory. This results in many small requests for data from the networked file system, which is slower than loading larger blocks of data. Our solution to this problem is to preload the database into memory with a single system call. The database typically consumes the most memory during a run. The large size of the database could be larger than the area addressable by the system's translation look-aside buffer (TLB) when standard-size pages are used. Therefore, we increase the page-size of this preloaded region so that it requires fewer entries in the system's TLB, which decreases the number of TLB cache misses. We found that the use of 2 MB page sizes instead of the standard 4 KB can reduce the processing time of the BLAST tool by 15% to 60% depending on the parameters used [16].
The performance of output operations was also improved in several ways. Originally, data would be written to disk immediately when available, resulting in many individual writes to the file system and plaguing performance. Our improvement consists of a two-stage buffering technique. In this scheme, the tools write output to an in-memory buffer instead of directly to disk. The data is flushed from in-memory buffers to disk by a background process when the buffers are nearly full, rather than on demand. This increases the output bandwidth and also results in more uniform output time. Writing to disk in the background also reduces blocking in the tool itself. Additionally, we distribute the output files across multiple directories to optimize the common scenario of architectures with a distributed file system. This optimization works because it increases the likelihood that the file system will locate the data on multiple hardware devices, resulting in higher bandwidth for parallel I/O. In addition to writing data in blocks, we provide the option of compressing the actual output blocks. Allowing output to be transferred and stored in compressed form helps to increase the throughput of output records. Compression occurs in the background so that no additional latency is introduced when the tool writes data. These output enhancements have shown an increase in output bandwidth of 2809% [16]. All the above optimizations are performed by a single wrapper [16], shown in Figure 1.
Output analysis tools
We have also developed large-scale output data analysis tools, because the data generated by these massive runs of query sequences could vary from gigabytes to terabytes and sometimes accessing and analyzing such large datasets is prohibitive. We have generated parsers to parse the XML [29] and other output formats into tab-delimited and user friendly outputs, along with generating SQL databases for easy retrieval of results. The user will have access to his or her data either in raw data formats that the tools generated or the parsed data formats if desired. All the results will then be made accessible through science gateways. Figure 2 illustrates this approach, where the user interfaces with the science gateway portal and initiates a job that is run through parallel modules on an HPC resource. Based on the user's needs, either the raw results are delivered back to the user, or the output is parsed and prepared for easier use before it is made accessible through the gateway.
Science gateway: challenges and solutions
The usefulness of science gateways, which provide the ability to submit jobs to HPC resources remotely via a defined service, is well established [22]. By allowing scientists, researchers, and students to use HPC resources in this way, we can provide a user interface that is highly customized to the user's purpose. This saves time and effort on the researcher's part, and by regulating and standardizing the input interface, we use computing resources more efficiently through the streamlining of complex workflows and the reduction of obstacles caused by deficiencies of expertise and experience with computational resources.
In the process of creating the PoPLAR science gateway, we were faced with several design decisions that directly affected the performance and functionality of our end product. To provide a single experience accessible to the broadest audience, we have implemented PoPLAR as a web-based portal, rather than an alternative solution such as a thick client or desktop application. This also allows us to better centralize management and data flow, and to enforce requirements and best practices such as those for XSEDE resources.
The initial challenge when establishing a portal is the selection of an architecture that can fulfill the requirements of a science gateway and the needs of the project and its users. The different components have mostly out-of-the-box solutions. While many possible options exist, we identified three that most closely met our needs: the HUBzero [30], Galaxy [29], and CIPRES [27] platforms. HUBzero and Galaxy were the first two platforms we examined. They were both attractive solutions for a front-end, because they are feature rich, with built-in capabilities that fulfill many of the XSEDE gateway best practices (e.g., user registry with permissions, logging of users and usage, job monitoring, statistics tracking, system logging), and have powerful administrative interfaces. HUBzero is a general science collaboration environment that is very focused on allowing users to contribute tools, whereas Galaxy is more focused on biological sciences and has strong workflow construction functionality. Both have some version of submitting jobs to remote computational resources. From our perspective, the strength of HUBzero and Galaxy was also their weakness in that extra development time would be necessary to make them usable for our purposes; and generalizability could detract from our more focused research. The time investment was our primary concern for both relative to adapting and supporting the packages.
The third solution we examined and chose is an adaptation of the CIPRES Science Gateway (CSG) platform [31]. The CSG platform combines powerful, built-in capabilities and a focus on computational biology applications in such a way that it meets most of our feature requirements without over-scoping and requiring extensive customization. Some of the most attractive aspects of CSG included (a) a focus on HPC applications, specifically adapted to both XSEDE and academic resources; (b) a scalable architecture designed for fully exposing multiple applications (such as our Highly Scalable Parallel [HSP] tool suites) to users via an easy-to-use graphical interface; and (c) total customizability and parameterization of these tools.
CSG uses the Workbench Framework, which is a "generic distributable platform to access and search remote databases and deploy jobs on remote computational resources" [31]. The Workbench Framework implementation provides a scalable mechanism of XML descriptions that map to graphical user interfaces (GUIs); and furnishes a schema for constructing command line statements with the user input entered into those GUIs [21]. This approach offers scalability via ease of development, a robust mechanism for specifying, capturing, and error-checking user parameters, and abstracting presentation from content to allow for separate manipulation and development of both.
For those reasons and because CSG so closely matches our application needs, we selected it and thus were able to reduce initial implementation time and focus on customizing and extending the framework.
One of the biggest challenges we aim to address is the transfer of data. Our system allows for multiple runs of jobs--in parallel--on very large bioinformatics data sets. When using these tools locally on an HPC resource, the resulting output is stored to the file system. The design of this project requires the delivery of that output to the end user who has submitted the job via a web interface that is not physically co-located with the computation machine's file system. Therefore, scaling presents major challenges relative to handling large transfers and meeting local storage requirements. The synchronization of other job information with the user, such as job status, completion, errors, and so forth, also presents an issue. We have implemented a system and run jobs of moderate size via PoPLAR, and are continuing our efforts to improve the scalability of the system.
Science gateway: architecture overview
PoPLAR, our science gateway, uses the CSG framework [31] and incorporates a Java Struts2 [32] based web portal running on Linux and Apache Tomcat with a MySQL database and employs Python job management scripts on remote computational resources. Our implementation supports only registered users and restricts job submission to only verified users with activated accounts. The web interface allows users to upload data sets; and create, configure, and submit jobs using specific tools, with their uploaded input data and tunable per-tool parameter settings. After job submission, the system populates the results into the portal and notifies the user of job completion. Screenshots showing the login interface, an example user-configurable parameter setup for the HSP-BLAST tool, and output after a successful job, are shown in Figures 3, 4 and 5.
The system uses a community account for job submission and allows for individual user registration, authorization and authentication, detailed logging of portal and computation usage, submission of user attributes to compute resources, the ability to restrict or deny access to individual users, as well as system logging and other features. Our adaptations after implementation include restriction of access to registered users, the addition of a verification process for account activation, the enforcement of country of citizenship restrictions on access to computational resources, the adaptation of remote resource job maintenance scripts to the National Institute for Computational Sciences supercomputer, and changes in branding. We have incorporated several of our highly parallel scalable tools into the toolkit. We are working to extend the framework by incorporating integrated workflows across multiple tools, adding parsing and analysis tools as discussed above, and scaling data handling to hundreds of gigabytes for I/O.