- Open Access
Closha: bioinformatics workflow system for the analysis of massive sequencing data
- GunHwan Ko†,
- Pan-Gyu Kim†,
- Jongcheol Yoon1,
- Gukhee Han1,
- Seong-Jin Park1,
- Wangho Song1 and
- Byungwook Lee1Email author
© The Author(s). 2018
- Published: 19 February 2018
While next-generation sequencing (NGS) costs have fallen in recent years, the cost and complexity of computation remain substantial obstacles to the use of NGS in bio-medical care and genomic research. The rapidly increasing amounts of data available from the new high-throughput methods have made data processing infeasible without automated pipelines. The integration of data and analytic resources into workflow systems provides a solution to the problem by simplifying the task of data analysis.
To address this challenge, we developed a cloud-based workflow management system, Closha, to provide fast and cost-effective analysis of massive genomic data. We implemented complex workflows making optimal use of high-performance computing clusters. Closha allows users to create multi-step analyses using drag and drop functionality and to modify the parameters of pipeline tools. Users can also import the Galaxy pipelines into Closha. Closha is a hybrid system that enables users to use both analysis programs providing traditional tools and MapReduce-based big data analysis programs simultaneously in a single pipeline. Thus, the execution of analytics algorithms can be parallelized, speeding up the whole process. We also developed a high-speed data transmission solution, KoDS, to transmit a large amount of data at a fast rate. KoDS has a file transfer speed of up to 10 times that of normal FTP and HTTP. The computer hardware for Closha is 660 CPU cores and 800 TB of disk storage, enabling 500 jobs to run at the same time.
Closha is a scalable, cost-effective, and publicly available web service for large-scale genomic data analysis. Closha supports the reliable and highly scalable execution of sequencing analysis workflows in a fully automated manner. Closha provides a user-friendly interface to all genomic scientists to try to derive accurate results from NGS platform data. The Closha cloud server is freely available for use from http://closha.kobic.re.kr/.
With the emergence of next generation sequencing (NGS) technology in 2005, the field of genomics is caught in a data deluge. Modern sequencing platforms are capable of sequencing approximately 5000 M-bases per day . DNA sequencing is becoming faster and less expensive at a pace far outstripping Moore’s law, which describes the rate at which computing becomes faster and less expensive. As a result of the increased efficiency and diminished cost of NGS, the demand for clinical and agricultural applications is rapidly increasing . In the bioinformatics community, acquiring massive sequencing data is always followed by large-scale computational analysis to process the data and obtain scientific insights. Therefore, investment in a sequencing instrument would normally be accompanied by substantial investment in computer hardware, analysis pipelines, and bioinformatics experts to analyze the data .
When genomic datasets were small, they could be analyzed on personal computers in a few hours or perhaps overnight . However, this approach does not apply to large NGS datasets. Instead, researchers require high-performance computers and parallel algorithms to analyze their big genomic data in a timely manner . While high-performance computing is essential for data analysis, only a small number of biomedical research labs are equipped to make effective and successful use of parallel computers . Obstacles include the complexities inherent in managing large NGS datasets and assembling and configuring multi-step genome sequencing pipelines, as well as the difficulties inherent in adapting pipelines to process NGS data on parallel computers .
The difficulties in creating these complicated computational pipelines, installing and maintaining software packages, and obtaining sufficient computational resources tend to overwhelm bench biologists and prevent them from attempting to analyze their own genomic data . Despite the availability of a vast set of computational tools and methods for genomic data analysis , it is still challenging for a genomic researcher to organize these tools, integrate them into workable pipelines, find accessible computational platforms, configure the computing environment, and perform the actual analysis.
To address these challenges, the MapReduce  model and the corresponding Apache Hadoop framework have been widely adopted to handle large data sets using parallel processing tools . The most widely used open-source implementation of the MapReduce programming model for big data batch processing is Apache Hadoop. A cloud-based bioinformatics workflow platform has also been proposed for genomic researchers. Scientific workflow systems such as Galaxy  and Taverna  offer simple web-based workflow toolkits and scalable computing environments to meet this challenge.
Such efforts have resulted in significant insight into the technical requirements to leverage cloud computing for the analysis of genomic data , but problems still remain to be solved. Even though many applications have been developed for the analysis of genomic data, they are either tools running only on a MapReduce platform such as Hadoop BAM  or Crossbow  or general-purpose (mainly Linux-based) programs such as bowtie  and bwa . It is crucial to integrate these two types of platform-based applications on a single pipeline. Transferring these big data is another problem, as NGS genomic data is too large to use cloud computing platform services .
We developed an automatic workflow management system, Closha, to provide a pipeline-based analysis service for massive biological data, especially NGS genomic data. Closha was developed as a hybrid system that can run both Hadoop-based and general-purpose applications on a single analysis pipeline. We also developed a high-speed data transmission solution, KoDS, to transmit a large amount of data at a fast rate. Closha makes it simple to create multi-step analysis using a simple drag and drop functionality. Using Closha, programs can be added and connected to each other so that the output of one program becomes the input of other programs. Our cloud-based workflow management system can help users to run in-house pipelines or construct a series of steps in an organized way.
Goals of Closha
The following three objectives drive the development of Closha. First, Closha seeks to increase access to intricate computational analyses for all genomic researchers, including those with limited or no programming knowledge. Our web-based graphical user interface (GUI) makes it simple to do everything needed for relatively large data analyses. Second, the Closha GUI provides a workflow editor in which users can simply create automated, multi-step analysis pipelines using drag and drop. Here, workflows refer to structured procedures that help users construct a series of steps in an organized way. Each step is a specific parametrized action that receives input and produces output. The analysis pipelines on Closha are exactly reproducible, and all analysis parameters and inputs are permanently recorded. Lastly, Closha enables users to share their pipelines on the web.
Analysis pipelines are grouped into categories and can be searched on the pipeline panel. When a pipeline is selected, it is shown in the main window, where its parameters are set and the tool is executed. When a user executes a tool, its output datasets are added to the execution and history panel. The colors on the execution panel shows the state of tool execution. Clicking on a dataset in the panel provides a wealth of information, including the tool and parameter settings used to create it.
Workflow editor (canvas)
The canvas is an interface for creating and modifying workflows (analysis pipelines) by arranging and connecting activities to drive processes. The canvas provides the working surface for creating new workflows or editing existing ones. Users can create custom workflows or use existing workflows on the screen. The canvas (Fig. 2) makes it simple to create multi-step analyses using drag and drop functionality. Using the canvas, existing and user-uploaded tools can be added and connected so that the output of one tool becomes the input of other tools. Tool parameters can be set in the parameter panel. Workflows enable the automation and repeated running of large analyses. Once created, workflows function as tools. They can be accessed and run from Closha’s main analysis interface.
Representing analysis pipelines of workflows
The workflows in the analysis pipelines are commonly depicted as directed acyclical graphs, in which each of the vertices (modules or programs) has a unique identifier and represents a task to be performed. Additionally, each of the tasks in a workflow can receive inputs and can produce outputs. The outputs of a task can be directed through another task as input. An edge (connector) between two vertices represents the channeling of an output from one task into another. Edges determine the logical sequence. A task can be executed once all of its inputs can be resolved.
Uploading data to Closha
We implemented a service-oriented architecture, a hybrid system, to allow arbitrary tools to be described as services. The hybrid system provides access to traditional applications on a cloud infrastructure, which enables users to use both the MapReduce tools and the traditional programs in a single pipeline simultaneously. Thus, the execution of analytical algorithms can be parallelized, speeding up the whole process.
Scalability is the capability of a system, network, or process to handle a growing amount of work or its potential to be enlarged to accommodate that growth. For example, a system is considered scalable if it can increase its total output under an increased load when resources (typically hardware) are added. A system whose performance improves after adding hardware, in proportion to the capacity added, is said to be a scalable system. Scalability is one of the most attractive prospects of the benefit-rich phenomenon of cloud computing and provides a useful safety net for when a user’s needs and demands change. The resource manager and the job controller on Closha elastically control the scalability by either increasing or decreasing the required resources.
As of October 1st, approximately 200 analysis tools were installed on Closha, and 20 analysis pipelines were available for the analysis of exome, RNA-Seq, and ChiP-Seq, data, among others. Closha has two types of pipelines: registered and new. Users can use a registered pipeline suitable for their genomic data by selecting a pipeline in the Closha analysis pipeline list. If users want to create a new analysis pipeline, they can build their own pipeline either from scratch or by modifying a registered pipeline with installed or user-defined tools.
The pipeline includes five analysis tools: TopHat , Cufflinks, Cuffmerge, Cuffdiff, and limma voom . TopHat is a fast splice junction mapper that is used to align RNA-Seq reads to large genomes and analyze the mapping results to identify splicing junctions between exons. TopHat internally uses the Bowtie tool, an ultra-high-throughput short read aligner. Cufflinks is used to assemble these alignments into a parsimonious set of transcripts and then estimate the relative abundances of these transcripts. The main purpose of Cuffmerge is to merge several Cufflinks assemblies, making it easier to produce an assembly GTF file suitable for use with Cuffdiff. Cuffdiff is then used to find significant changes in transcript expression, splicing, and promoter use. Finally, voom robustly estimates the mean-variance relationship and generates a precision weight for each individual normalized observation. It can be used to calculate differently expressed genes (DEGs) from the transcript expression level. Figure 4b depicts the implemented RNA-Seq pipeline on the Closha canvas.
Execution time of each program of Closha and Galaxy in the RNA-Seq analysis
Analysis steps (programs)
1 h 14 mins
2 h 36 mins
3 h 4 mins
Cuffdiff and voom
1 h 6 mins
Total running time: 3 h 44 mins
Total running time: 6 h 11 mins
Running time of multiple jobs
No. of jobs
Running time of each job
Creating a new pipeline
Closha allows users to create their own pipelines to analyze their own data on the canvas. To create a new analysis pipeline, users click the ‘New Pipeline’ button in the top menu of Closha, enter the name and description of the pipeline, and select an analysis pipeline type. Users will have only the [Start] and [End] modules on the canvas immediately upon creating a pipeline after selecting a ‘new analysis pipeline design’ in the project type. Users can drag and drop their desired analysis programs in the list of analysis programs on the right of the canvas. Upon positioning a desired analysis program on the canvas, when the users places the mouse over the edge of the analysis program icon, a connection mark will be created that can be drawn to the module. Starting from the mark, the connector must be dragged until the icon of the next analysis program to be connected turns translucent. Users can make connections to the start module, the analysis program and the end module using this method to perform the analysis.
Then, users can set the parameter values by clicking the ‘Set Parameters’ button on the toolbar before executing the pipeline project. On the creation of an initial project, default parameter values are automatically assigned. Users can change the parameter values in accordance with the conditions required to set and analyze their input data. To connect user files to Closha, the user can click the ‘File Selection’ icon in the field to open a window that allows the selection of an input file and then a personal or common-use data and the desired file in the file list. The path for the output file is automatically a sub path of the project in setting the input data. Finally, the analysis pipeline is executed with a message that the analysis has started. The status of the project is displayed on a real-time basis in three modes: Complete, Execute, and Wait.
The Closha computing service is an attractive, efficient and potentially cost-effective alternative for the analysis of large genomic datasets. Closha offers a dynamic, economical, and versatile solution for large-scale computational analysis. Our work on genomic data demonstrate that Closha implementation provides a scalable, robust and efficient solution to address the ever-increasing demand for efficient genomic sequence analysis. Closha allows genomic researchers without informatics or programming expertise to perform complex large-scale analysis with only a web browser. Its potentials for computing with NGS genomic data could eventually revolutionize life science and medical informatics.
We developed a cloud-based workflow management system to provide fast and cost-effective analysis of massive genomic data. We implemented complex workflows making optimal use of high-performance computing clusters. Closha allows users to create multi-step analyses using drag and drop functionality and to modify the parameters of pipeline tools. We also developed a high-speed data transmission solution to transmit a large amount of data at a fast rate. KoDS has a file transfer speed of up to 10 times that of normal FTP and HTTP. The computer hardware for Closha is 660 CPU cores and 800 TB of disk storage, enabling 500 jobs to run at the same time. Closha is a scalable, cost-effective, and publicly available web service for large-scale genomic data analysis. Closha supports the reliable and highly scalable execution of sequencing analysis workflows in a fully automated manner. Closha provides a user-friendly interface to all genomic scientists to try to derive accurate results from NGS platform data.
The authors would like to thank the anonymous reviewers and Closha users for their time and their valuable comments.
Publication costs were funded by the KRIBB Research Initiative Program and the Korean Ministry of Science and Technology (under grant numbers 2010–0029345 and 2014M3C9A3064681).
Availability of data and materials
About this supplement
This article has been published as part of BMC Bioinformatics Volume 19 Supplement 1, 2018: Proceedings of the 28th International Conference on Genome Informatics: bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-19-supplement-1.
GK and PK launched the Closha project and developed the cloud computing service. JY, GH, SP, and WS were responsible for development of the web interface and the back-end cloud system. BL supervised the project. GK and BL wrote the draft of the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Souilmi Y, Lancaster AK, Jung JY, Rizzo E, Hawkins JB, Powles R, Amzazi S, Ghazal H, Tonellato PJ, Wall DP. Scalable and cost-effective NGS genotyping in the cloud. BMC Med Genet. 2015;8:64.Google Scholar
- Afgan E, Baker D, van den Beek M, Blankenberg D, Bouvier D, Cech M, Chilton J, Clements D, Coraor N, Eberhard C, et al. The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res. 2016;44(W1):W3–W10.View ArticlePubMedPubMed CentralGoogle Scholar
- de la Garza L, Veit J, Szolek A, Rottig M, Aiche S, Gesing S, Reinert K, Kohlbacher O. From the desktop to the grid: scalable bioinformatics via workflow conversion. BMC bioinformatics. 2016;17(127)Google Scholar
- Huang Z, Rustagi N, Veeraraghavan N, Carroll A, Gibbs R, Boerwinkle E, Venkata MG, Yu F. A hybrid computational strategy to address WGS variant analysis in >5000 samples. BMC bioinformatics. 2016;17(1):361.View ArticlePubMedPubMed CentralGoogle Scholar
- Goecks J, Eberhard C, Too T, Galaxy T, Nekrutenko A, Taylor J. Web-based visual analysis for high-throughput genomics. BMC Genomics. 2013;14:397.View ArticlePubMedPubMed CentralGoogle Scholar
- Langdon WB. Performance of genetic programming optimised Bowtie2 on genome comparison and analytic testing (GCAT) benchmarks. BioData mining. 2015;8(1):1.View ArticlePubMedPubMed CentralGoogle Scholar
- Yazar S, Gooden GE, Mackey DA, Hewitt AW. Benchmarking undedicated cloud computing providers for analysis of genomic datasets. PLoS One. 2014;9(9):e108490.View ArticlePubMedPubMed CentralGoogle Scholar
- Abouelhoda M, Issa SA, Ghanem M. Tavaxy: integrating Taverna and galaxy workflows with cloud computing support. BMC bioinformatics. 2012;13:77.View ArticlePubMedPubMed CentralGoogle Scholar
- O'Driscoll A, Daugelaite J, Sleator RD. 'Big data', Hadoop and cloud computing in genomics. J Biomed Inform. 2013;46(5):774–81.View ArticlePubMedGoogle Scholar
- Hiltemann S, Mei H, de Hollander M, Palli I, van der Spek P, Jenster G, Stubbs A. CGtag: complete genomics toolkit and annotation in a cloud-based galaxy. GigaScience. 2014;3(1):1.View ArticlePubMedPubMed CentralGoogle Scholar
- Goecks J, Nekrutenko A, Taylor J, Galaxy T. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010;11(8):R86.View ArticlePubMedPubMed CentralGoogle Scholar
- Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, et al. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics. 2004;20(17):3045–54.View ArticlePubMedGoogle Scholar
- Niemenmaa M, Kallio A, Schumacher A, Klemela P, Korpelainen E, Heljanko K. Hadoop-BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics. 2012;28(6):876–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhao S, Prenger K, Smith L, Messina T, Fan H, Jaeger E, Stephens S. Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing. BMC Genomics. 2013;14:425.View ArticlePubMedPubMed CentralGoogle Scholar
- Gurtowski J, Schatz MC, Langmead B. Genotyping in the cloud with Crossbow. Current protocols in bioinformatics. 2012; Chapter 15:Unit15 13Google Scholar
- Nagasaki H, Mochizuki T, Kodama Y, Saruhashi S, Morizaki S, Sugawara H, Ohyanagi H, Kurata N, Okubo K, Takagi T, et al. DDBJ read annotation pipeline: a cloud computing-based pipeline for high-throughput analysis of next-generation sequencing data. DNA research : an international journal for rapid publication of reports on genes and genomes. 2013;20(4):383–90.View ArticleGoogle Scholar
- Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14(4):R36.View ArticlePubMedPubMed CentralGoogle Scholar
- Law CW, Alhamdoosh M, Su S, Smyth GK, Ritchie ME. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000Research. 2016;5:1408.View ArticlePubMedPubMed CentralGoogle Scholar