PoPLAR: Portal for Petascale Lifescience Applications and Research
© Rekapalli et al.; licensee BioMed Central Ltd. 2013
Published: 28 June 2013
Skip to main content
© Rekapalli et al.; licensee BioMed Central Ltd. 2013
Published: 28 June 2013
We are focusing specifically on fast data analysis and retrieval in bioinformatics that will have a direct impact on the quality of human health and the environment. The exponential growth of data generated in biology research, from small atoms to big ecosystems, necessitates an increasingly large computational component to perform analyses. Novel DNA sequencing technologies and complementary high-throughput approaches--such as proteomics, genomics, metabolomics, and meta-genomics--drive data-intensive bioinformatics. While individual research centers or universities could once provide for these applications, this is no longer the case. Today, only specialized national centers can deliver the level of computing resources required to meet the challenges posed by rapid data growth and the resulting computational demand. Consequently, we are developing massively parallel applications to analyze the growing flood of biological data and contribute to the rapid discovery of novel knowledge.
The efforts of previous National Science Foundation (NSF) projects provided for the generation of parallel modules for widely used bioinformatics applications on the Kraken supercomputer. We have profiled and optimized the code of some of the scientific community's most widely used desktop and small-cluster-based applications, including BLAST from the National Center for Biotechnology Information (NCBI), HMMER, and MUSCLE; scaled them to tens of thousands of cores on high-performance computing (HPC) architectures; made them robust and portable to next-generation architectures; and incorporated these parallel applications in science gateways with a web-based portal.
This paper will discuss the various developmental stages, challenges, and solutions involved in taking bioinformatics applications from the desktop to petascale with a front-end portal for very-large-scale data analysis in the life sciences.
This research will help to bridge the gap between the rate of data generation and the speed at which scientists can study this data. The ability to rapidly analyze data at such a large scale is having a significant, direct impact on science achieved by collaborators who are currently using these tools on supercomputers.
The data generated by various scientific and non-scientific fields is growing exponentially with modern technologies. This paper focuses on the exponential growth of data in the life sciences, which has surpassed the rate at which both processing power and storage technologies are growing . In this work we focus on bioinformatics, a key area of information processing that has a direct impact on the quality of human life. Since the data in bioinformatics is growing beyond the scope of a single computing architecture, we are working on developing efficient, optimized, and highly scalable parallel applications on the latest and next-generation supercomputing architectures to meet the rising demand. The previously mentioned challenges relative to data growth and large-scale knowledge discovery could be addressed in three phases: the first phase is the development of parallel applications to analyze data at a rapid rate; the second is the creation of methods for easy access to these parallel applications; and the third is the development of databases to host analyzed data for quicker retrieval.
The software parallelizations that can be explored to address these gigantic problems are data parallelism, functional parallelism, and a combination of both data and functional parallelism. This paper discusses software wrappers used for data parallelization to process large-scale data in less time. These parallel wrappers aid us in parallelizing the most widely used applications in bioinformatics. The development of parallel applications is not the only aspect of the solution required for this data growth problem. After these applications have been developed, creating interfaces that will allow researchers with limited computing expertise to access these tools to solve their data analysis problems will also be essential. For this reason, we are constructing a science gateway with a web interface to expose these parallel applications for easy access. Finally, we are also developing a suite of tools to parse the outputs from all of the parallel applications and store the data in databases for faster retrieval and knowledge discovery since few of these tools generate more output data than input data by an order of magnitude or more.
Many parallel bioinformatics tools designed for large-scale data analysis exists. These implementations range from clusters to supercomputers and from grids to cloud computers. Wide ranges of workflows, gateways, and packages for various bioinformatics tools are also available, but implementations of highly scalable parallel bioinformatics applications with science gateways for large-scale data analysis are few to non-existent.
To build workflows needed by bioinformatics labs in the states of Tennessee and South Carolina as a part of an NSF proposal, we have identified the most widely used bioinformatics tools for sequence similarity searches, multiple sequence alignments, protein domain analysis, and phylogenetic tree construction. These tools include NCBI BLAST: Basic Local Alignment Search Tool ; HMMER  by HMMI Janelia Farm; MUSCLE: Multiple Sequence Comparison by Log-Exception ; ClustalW ; and MrBayes . Many parallel implementations of these tools focused for clusters exist, such as mpiBLAST , ScalaBLAST , mpiHMMER , ClustalW-MPI , and Parallel MrBayes . These tools achieve scalability by selecting techniques such as database fragmentation, parallel input and output (I/O), load balancing, and query prefetching, but only a few highly scalable implementations capable of using tens of thousands of cores on supercomputers are available, such as BLAST on Blue Gene/L implementation , pioBLAST , HSP-HMMER [14, 15], and HSPp-BLAST . By creating these highly scalable parallel tools and making them available to researchers through a science gateway, we are making a significant contribution to solving the pressing analysis needs in these areas.
Many research groups have active science gateway portals as part of the state-of-the-art Extreme Science and Engineering Discovery Environment (XSEDE) NSF program. These gateways focus on fields such as biochemistry [17, 18], biomedical computation , protein structure and interaction , and systemic and population biology . These portals offer multiple tools for their respective areas and run a subset of those tools on machines that are best suited for HPC computational resources. For example, of the dozens of phylogenetic tools available on CIPRES, versions of MrBayes, RAxML, GARLI, and BEAST use XSEDE resources , and Robetta is designed to run on computer clusters distributed as mirrors .
Many portals have workflow capabilities for bioinformatics tools, such as Galaxy . Some XSEDE informatics tools have available workflows, like ChemBioGrid , which has web-based computational workflows built on the Taverna  workflow tool. Currently, no portals exist that combine workflow between highly scalable bioinformatics tools on HPC resources with gateway access, along with tools for output data analysis. Our science gateway, the Portal for Petascale Lifescience Applications & Research (PoPLAR), will provide easy access to this unique combination of powerful, highly scalable parallel bioinformatics applications, output analysis tools, and knowledge-discovery resources. Our efforts to provide a new solution to this important problem are described in the section on development of parallel applications.
In this section, we focus on describing the process to take a desktop bioinformatics application, scale it to tens of thousands of cores on supercomputers, and expose it to a science gateway for easy access. We look at the complexity of the code and various profiling options. We also identify the computationally intensive and data intensive sections of the code, and examine parallelizing the code with no change in the functionality of the application as described in following sections.
The unique design of petascale supercomputers, with performance exceeding one petaflop, tens of thousands of computing cores, low per-core systems memory, and a reliance on networked distributed filesystems makes the architecture different from clusters and desktop machines. Efficient use of these machines requires special operating systems and carefully designed and tuned applications. One possible way to improve the performance of tools that run on petascale computers is by changing the actual algorithms to better suit the characteristics of supercomputers; this approach involves the tedious process of recoding already complex and specialized tools. The alternative approach that we have taken is to profile the code so as to identify the computationally intensive functions and I/O intensive functions.
We use CrayPat , PAPI , gprof , and our own timing code to analyze the runtime of the tools and eventually our own optimizations as well. The CrayPat profiler was useful in determining possible improvements to the data output scheme and improvements to the overall MPI communication. A second profiler, gprof, was used to analyze the runtime of the tools themselves. A great deal of source code is involved with each application; for example, NCBI's BLAST code base is approximately 1.3 million lines, HMMER at 35,000, and MUSCLE with nearly 28,000 lines. Because of the extensive code base of these tools, as few changes as possible should be made to the code so as to avoid the need to maintain millions of lines of forked software.
Another common issue related to file access patterns is the reliance on sequential loading of files. In this model, a file is memory-mapped into the process's virtual address space and then read sequentially. A page fault occurs each time the program attempts to read data that has not yet been loaded from the filesystem. Then this next page of data is loaded into memory. This results in many small requests for data from the networked file system, which is slower than loading larger blocks of data. Our solution to this problem is to preload the database into memory with a single system call. The database typically consumes the most memory during a run. The large size of the database could be larger than the area addressable by the system's translation look-aside buffer (TLB) when standard-size pages are used. Therefore, we increase the page-size of this preloaded region so that it requires fewer entries in the system's TLB, which decreases the number of TLB cache misses. We found that the use of 2 MB page sizes instead of the standard 4 KB can reduce the processing time of the BLAST tool by 15% to 60% depending on the parameters used .
The performance of output operations was also improved in several ways. Originally, data would be written to disk immediately when available, resulting in many individual writes to the file system and plaguing performance. Our improvement consists of a two-stage buffering technique. In this scheme, the tools write output to an in-memory buffer instead of directly to disk. The data is flushed from in-memory buffers to disk by a background process when the buffers are nearly full, rather than on demand. This increases the output bandwidth and also results in more uniform output time. Writing to disk in the background also reduces blocking in the tool itself. Additionally, we distribute the output files across multiple directories to optimize the common scenario of architectures with a distributed file system. This optimization works because it increases the likelihood that the file system will locate the data on multiple hardware devices, resulting in higher bandwidth for parallel I/O. In addition to writing data in blocks, we provide the option of compressing the actual output blocks. Allowing output to be transferred and stored in compressed form helps to increase the throughput of output records. Compression occurs in the background so that no additional latency is introduced when the tool writes data. These output enhancements have shown an increase in output bandwidth of 2809% . All the above optimizations are performed by a single wrapper , shown in Figure 1.
The usefulness of science gateways, which provide the ability to submit jobs to HPC resources remotely via a defined service, is well established . By allowing scientists, researchers, and students to use HPC resources in this way, we can provide a user interface that is highly customized to the user's purpose. This saves time and effort on the researcher's part, and by regulating and standardizing the input interface, we use computing resources more efficiently through the streamlining of complex workflows and the reduction of obstacles caused by deficiencies of expertise and experience with computational resources.
In the process of creating the PoPLAR science gateway, we were faced with several design decisions that directly affected the performance and functionality of our end product. To provide a single experience accessible to the broadest audience, we have implemented PoPLAR as a web-based portal, rather than an alternative solution such as a thick client or desktop application. This also allows us to better centralize management and data flow, and to enforce requirements and best practices such as those for XSEDE resources.
The initial challenge when establishing a portal is the selection of an architecture that can fulfill the requirements of a science gateway and the needs of the project and its users. The different components have mostly out-of-the-box solutions. While many possible options exist, we identified three that most closely met our needs: the HUBzero , Galaxy , and CIPRES  platforms. HUBzero and Galaxy were the first two platforms we examined. They were both attractive solutions for a front-end, because they are feature rich, with built-in capabilities that fulfill many of the XSEDE gateway best practices (e.g., user registry with permissions, logging of users and usage, job monitoring, statistics tracking, system logging), and have powerful administrative interfaces. HUBzero is a general science collaboration environment that is very focused on allowing users to contribute tools, whereas Galaxy is more focused on biological sciences and has strong workflow construction functionality. Both have some version of submitting jobs to remote computational resources. From our perspective, the strength of HUBzero and Galaxy was also their weakness in that extra development time would be necessary to make them usable for our purposes; and generalizability could detract from our more focused research. The time investment was our primary concern for both relative to adapting and supporting the packages.
The third solution we examined and chose is an adaptation of the CIPRES Science Gateway (CSG) platform . The CSG platform combines powerful, built-in capabilities and a focus on computational biology applications in such a way that it meets most of our feature requirements without over-scoping and requiring extensive customization. Some of the most attractive aspects of CSG included (a) a focus on HPC applications, specifically adapted to both XSEDE and academic resources; (b) a scalable architecture designed for fully exposing multiple applications (such as our Highly Scalable Parallel [HSP] tool suites) to users via an easy-to-use graphical interface; and (c) total customizability and parameterization of these tools.
CSG uses the Workbench Framework, which is a "generic distributable platform to access and search remote databases and deploy jobs on remote computational resources" . The Workbench Framework implementation provides a scalable mechanism of XML descriptions that map to graphical user interfaces (GUIs); and furnishes a schema for constructing command line statements with the user input entered into those GUIs . This approach offers scalability via ease of development, a robust mechanism for specifying, capturing, and error-checking user parameters, and abstracting presentation from content to allow for separate manipulation and development of both.
For those reasons and because CSG so closely matches our application needs, we selected it and thus were able to reduce initial implementation time and focus on customizing and extending the framework.
One of the biggest challenges we aim to address is the transfer of data. Our system allows for multiple runs of jobs--in parallel--on very large bioinformatics data sets. When using these tools locally on an HPC resource, the resulting output is stored to the file system. The design of this project requires the delivery of that output to the end user who has submitted the job via a web interface that is not physically co-located with the computation machine's file system. Therefore, scaling presents major challenges relative to handling large transfers and meeting local storage requirements. The synchronization of other job information with the user, such as job status, completion, errors, and so forth, also presents an issue. We have implemented a system and run jobs of moderate size via PoPLAR, and are continuing our efforts to improve the scalability of the system.
The system uses a community account for job submission and allows for individual user registration, authorization and authentication, detailed logging of portal and computation usage, submission of user attributes to compute resources, the ability to restrict or deny access to individual users, as well as system logging and other features. Our adaptations after implementation include restriction of access to registered users, the addition of a verification process for account activation, the enforcement of country of citizenship restrictions on access to computational resources, the adaptation of remote resource job maintenance scripts to the National Institute for Computational Sciences supercomputer, and changes in branding. We have incorporated several of our highly parallel scalable tools into the toolkit. We are working to extend the framework by incorporating integrated workflows across multiple tools, adding parsing and analysis tools as discussed above, and scaling data handling to hundreds of gigabytes for I/O.
In this scenario, the researcher has a newly generated set of genomes or sequences to be annotated for domain models that determine the function, structure, and evolution of the proteins. First, the HMMER tool hmmscan is used to identify domain models in those sequences and those without domain matches. Ones with domain matches are identified and documented. To find sequences similar to those without domain matches, the next step is to use the NCBI PSI-BLAST tool. If similar sequences are found using PSI-BLAST to contain known domains, the models for those domain matches can be updated; thus, the next time HMMER should be able to identify the proper domain matches. If PSI-BLAST does not find sequences with domain matches, the next step is to build novel domain models. Before building a model, the matching domains are aligned with using tool like MUSCLE, which generates a multiple sequence alignment (MSA). The MSA is then used to build a new domain model with hmmbuild, part of the HMMER package. That domain model is checked against the known nr database using the HMMER tool hmmsearch. If hmmsearch generates domain matches, the existing models can be improved. If not, the researcher has confirmed that the model created is new and can now be added to domain databases around the world. By making a system where the I/O of each tool are reconfigurable, all the bioinformatics tools, result-parsing tools, and data-conversion tools of the workflow could be used in various combinations to solve different problems in biology.
The data transfer in our current system takes place between the end user and the gateway and between the gateway and the compute resource. A typical sequence begins with the user uploading an input set to the gateway. We note here that once uploaded, an input set is saved indefinitely for reuse, so the sequence may also begin with using a previously uploaded data set. Then, after the user selects a tool and configures its parameters, the gateway copies the input data to the compute resource using Globus tools. After the job completes, the gateway copies the results back, and the end user can view or download the data.
To examine the scalability of data transfer, we conducted a series of tests to determine the current limits of our system. We tested a variety of I/O file sizes, from 100MB to 5GB. Our results showed that for input, we are currently limited by our data upload capability; the maximum input size accommodated by our system is approximately 500MB. For output, we tested output sizes incrementally up to 5GB, without finding a size limitation. In the future we plan to address the upload limitation through software and hardware changes. We are also investigating methods to improve data transfer speeds and to eliminate the need to copy output back to the science gateway for user access. Finally, we are also adapting the gateway to allow users to easily apply output data as input into another tool.
The PoPLAR gateway provides an easy-to-use graphical interface that is more efficient than using the command line, for example, in several ways. It is worth noting that because the tools being run via PoPLAR and via the command line are identical, there is absolutely no difference from a processing perspective in the efficiency between methods of the tools themselves. Rather, the benefits PoPLAR provides are through the ease-of-use of the graphical interface as well as several functional advantages, including:
The GUI replaces the command line interface, which frees the user from needing experience with the command line.
Likewise, knowledge of each specific tool's syntax is not necessary, as each tool's parameters are broken out and presented via a web form with help text.
The portal presents the user with one location for data (inputs and results), easy organization of data into folders, and a history of all jobs submitted.
Users can submit and move on to other tasks, as notification of a job's completion is sent to the user via email.
Different tools can be configured to run transparently on different computation resources, all made available to the user via the same interface.
The computation-resource-agnostic interface does not require the user to know anything about the specifics of different systems (e.g., job submission engines, distributed filesystems, or command shells); the same tool running on different resources can be configured to present the same interface.
Jobs submitted via the portal do not require the user to have an account on and log in to each computation resource (although the portal does allow users to charge their existing allocation)--only a browser is required to access the portal from anywhere.
By addressing the challenges described above and developing a web-portal science gateway for highly parallelized tools for large-scale data analysis, we look forward to having a substantial positive effect on these fields. Our approach allows for easy, user-friendly access to supercomputing resources. With that access, users can more easily submit large-scale jobs. By facilitating rapid large-scale analysis, we are able to help fulfill the significant demand created by the growing volume of data analysis needs in bioinformatics.
The system we have implemented is easily extended to incorporate similar highly scalable parallel tools for other domain sciences. By design, tools located on any computational resource can be made available easily and seamlessly through the science gateway. This combination makes our tools-plus-gateway approach powerfully scalable across both computational resources and science application domains.
This paper provides a model for the development of highly scalable parallel bioinformatics applications on HPC architectures along with an increase in availability and usability through science gateways. The model directly enhances large-scale data analysis and knowledge discovery capabilities. This work revealed the following points: (a) Significant time and skills are required to change the entire code of an application to a specific architecture, and avoiding such an endeavor is the best practice; (b) a superior approach is to wrap the code of the bioinformatics application without changing functionality, identify the intensive I/O part of the code, and optimize communications to scale the code well, thus generating accurate results; (c) even though parallel tools are available at supercomputing facilities, many researchers are reluctant to use them due to a lack of expertise operating on the command line; (d) for these reasons mentioned here, developing science gateways with easy access to the parallel tools is a better way to facilitate and encourage the use of supercomputing resources by biologists. This research will have a direct impact on life sciences data and the rate of knowledge discovery. Still, many issues exist--such as job scheduling, load balancing, and fault tolerance--that need to be addressed with science gateways. We want our users to have total control over the type and size of data that can be analyzed, and we are addressing those issues in the life sciences gateway, as well as developing automated workflows for large-scale data analysis.
Project name: PoPLAR
Project home page: http://poplar.nics.tennessee.edu/
Operating system(s): Web based / Platform independent
Programming language: Java
Other requirements: No
License: Contact author
Any restrictions to use by non-academics: None
BR is the research scientist at the Joint Institute for Computational Sciences (UT-Knoxville-Oak Ridge National Laboratory) who is developing parallel bioinformatics applications on HPC machines and next-generation architectures, along with collaborating with researchers from various universities on large-scale data analysis in life sciences. PG and CR are graduate research assistants from the electrical engineering and computer science department at UT-Knoxville who are participating in BR's projects, and both bring expertise from computer science areas to this research.
CIPRES Science Gateway
graphical user interface
Highly Scalable Parallel (tool suites)
input and output
Message Passing Interface
multiple sequence alignment
National Center for Biotechnology Information
National Science Foundation
Portal for Petascale Lifescience Applications and Research
Position Specific Iterative
Structured Query Language
translation look-aside buffer
University of Tennessee, Knoxville
Extensible Markup Language
Extreme Science and Engineering Discovery Environment
This research used resources at the Joint Institute for Computational Sciences; Extreme Science and Engineering Discovery Environment (XSEDE), funded by the National Science Foundation (NSF); and also supported in part by the NSF grants EPS-0919436 and OCI-1053575. We would like to thank Mark Miller and Terri Schwartz for guidance during code development, and also thank Suresh Marru for technical assistance in adding the PoPLAR science gateway to XSEDE.
Publication of this article is supported by the National Institute for Computational Sciences, XSEDE funded by NSF and also supported in part by NSF grants EPS-0919436 and OCI-1053575.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 9, 2013: Selected articles from the 8th International Symposium on Bioinformatics Research and Applications (ISBRA'12). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S9.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.