SUSHI: an exquisite recipe for fully documented, reproducible and reusable NGS data analysis
© Hatakeyama et al. 2016
Received: 20 February 2016
Accepted: 26 May 2016
Published: 2 June 2016
Next generation sequencing (NGS) produces massive datasets consisting of billions of reads and up to thousands of samples. Subsequent bioinformatic analysis is typically done with the help of open source tools, where each application performs a single step towards the final result. This situation leaves the bioinformaticians with the tasks to combine the tools, manage the data files and meta-information, document the analysis, and ensure reproducibility.
We present SUSHI, an agile data analysis framework that relieves bioinformaticians from the administrative challenges of their data analysis. SUSHI lets users build reproducible data analysis workflows from individual applications and manages the input data, the parameters, meta-information with user-driven semantics, and the job scripts. As distinguishing features, SUSHI provides an expert command line interface as well as a convenient web interface to run bioinformatics tools. SUSHI datasets are self-contained and self-documented on the file system. This makes them fully reproducible and ready to be shared. With the associated meta-information being formatted as plain text tables, the datasets can be readily further analyzed and interpreted outside SUSHI.
SUSHI provides an exquisite recipe for analysing NGS data. By following the SUSHI recipe, SUSHI makes data analysis straightforward and takes care of documentation and administration tasks. Thus, the user can fully dedicate his time to the analysis itself. SUSHI is suitable for use by bioinformaticians as well as life science researchers. It is targeted for, but by no means constrained to, NGS data analysis. Our SUSHI instance is in productive use and has served as data analysis interface for more than 1000 data analysis projects. SUSHI source code as well as a demo server are freely available.
KeywordsData analysis framework Reproducible research Meta-level system design
Today’s bioinformatics faces the practical challenge to analyze massive and diverse data in a well documented and reproducible fashion. The situation is particularly challenging in the area of NGS research where state-of-the-art algorithms are frequently available as standalone tools and where a complete data analysis consists of many individual data processing and analysis steps. The considerations associated with conducting such a data analysis in a research environment have been discussed by W. S. Noble  and guidelines as well as an example strategy for organizing computational data analysis have been given. According to Noble a key principle is to record every operation such that reproducibility is ensured.
In this paper, we present SUSHI, which does Support Users for SHell-script Integration, a new approach to bioinformatics analysis that is centered on reusability, reproducibility and scalability. SUSHI produces analysis results as directories that are fully self-contained and hold all the information to be reproduced. Specifically, we document all parameters, input data, commands executed, as well as the versions of the tools and the reference data used. Additionally, we store meta-information on the experimental data together with the result files, so that those can be interpreted and further analyzed by other tools independently from SUSHI. This holds even if the analysis directory is transferred to collaborators with a different computing environment. SUSHI is extendable and we have put special emphasis on the simplicity of adding new software applications. A bioinformatician can define them within a single file and does not need special programming skills. SUSHI natively offers a command line interface as well as a web interface to run data analysis steps. Altogether, SUSHI lets bioinformaticians efficiently build analysis pipelines and ensures that analysis results are ready-to-be-shared and reproducible.
Various types of data analysis frameworks have already been implemented. They can be essentially divided into web-based frameworks and scripting frameworks. Examples for web-based frameworks are Galaxy , Chipster , GeneProf  or GenePattern . They let users run individual steps or entire pipelines on a remote compute system with the framework keeping track of the executed analysis. Scripting frameworks like bpipe , Ruffus , nestly , NGSANE , Makeflow , and Snakemake , let users build bioinformatics pipelines in a command line fashion. Given the different types of user interactions, the former solutions are more targeted for the experienced biologists or the application-oriented bioinformaticians while the latter address the needs of bioinformaticians who are more inclined to programming and high-throughput analysis of many datasets. However, there is no system as yet that natively offers both interfaces. Additionally, none of the existing frameworks puts an emphasis on having a human-readable and portable file-based representation of the meta-information and associated data.
SUSHI data sets and applications
A SUSHI application requires as input both a set of parameters and a DataSet object. This means that applications do not take bare data files as direct input. Instead, SUSHI applications take as input the DataSet meta-information object. The DataSet object holds, next to the data files, the meta-information necessary to process and interpret the data files. Based on its input, a SUSHI application first generates 1) the necessary job script(s), 2) a file representation of the parameters, and 3) the DataSet for the output data (Fig. 1 Step 1). The actual result data file(s) are generated by the job script(s) (Fig. 1 Step 2). The columns of the output DataSet hold again the meta-information, which now include additionally the parameters of the executed analysis if relevant for the further analysis or interpretation. The set of characteristics that is added to the annotation columns is defined and generated by the SUSHI application. The SUSHI framework itself does not require any specific annotation columns. Thus, the semantics of the DataSet columns are determined by the SUSHI applications (described in detail in the next section).
Every column of meta-information has a unique header that identifies the content, and optional tags that characterize the information type in the column. Tags are represented as comma-separated strings within square brackets in the column headers. Currently supported tags are File, Link, Factor, and Characteristics. Depending on the tags, the SUSHI framework provides appropriate actions for the corresponding columns. The File tag is reserved for actual file paths, and SUSHI checks if the file actually exists. If a column has a Link tag, SUSHI will automatically add a hyperlink to the data. Finally, the Factor column data will be used to group samples according to experimental factors, which is typically required in a differential gene expression analysis.
Example DataSet holding RNA-seq reads
Name: the name of the sample measured.
Read1 [File]: path to the file holding the reads; if reads are paired-end, this must be the first read.
Adapter1: potentially contaminating adapter sequence at the 3′-end of read 1.
Read2 [File]: path to the second read for paired-end data (only for paired-end data).
Adapter2: potentially contaminating adapter sequence at the 3′-end of read 2 (only for paired-end data).
Species: the species of the sample.
StrandMode: specifies whether the library preparation protocol preserved strand information.
Enrichment Kit: the kit employed to enrich the input material (e.g. poly-A selection kit)
Read Count: the number of reads in the file.
A sample DataSet
Example SUSHI application performing a FastQC report
A common task is to generate a FastQC report  for each sample in a read data set. If the fastqc package is installed one would run e.g.
The example code shows the essential features of a SUSHI application (See also Additional file 1). The @required_columns tells the SUSHI framework which columns a DataSet must have so that FastqcMinimal is applicable. In Fig. 2a, all applications that are compatible with the example reads data set are shown at the bottom, including the FastqcMinimal application. The @params['cores'] defines the number of cores to be used for multi-threading as a parameter with default value 4. This parameter is automatically turned into an input field in the web interface (see Fig. 2b. The code also defines with the method next_dataset the columns and content for the resulting DataSet. Finally, the method commands defines the command to be executed.
The SUSHI framework automatically performs administrative tasks such as putting the resulting file in the correct directory and managing the log files The full result directory is available as Additional file 2. A second example is the example for TopHat mapping  in Additional files 3 and 4. Both examples are for the illustrative purpose kept minimal. In real world application one would define additional parameters, support paired-end reads, and so on.
A list of all SUSHI applications that is in use at the Functional Genomics Center Zurich is available as Additional file 5 or on the SUSHI demo server: http://fgcz-sushi-demo.uzh.ch/sushi_application.
SUSHI applications are meta-process objects
Must be runnable from command line.
Must not require interactive input at run time.
Result must be representable as a file, or a set of files in a directory.
The framework cares about job execution, data file placement and cleanup of temporary data. Technically, a SUSHI application is implemented in a single Ruby file as a single Ruby class that takes over the SUSHI application super class using the template method design pattern (see Additional file 6).
SUSHI architecture follows meta-level design
SUSHI consists of three modules that are implemented in Ruby: 1) The SUSHI application module, as discussed above. 2) A Workflow Manager that performs job execution either on a local host or on a grid or cloud environment. 3) The SUSHI server that relies on Ruby on Rails and provides the web front-end. The SUSHI server delegates the core function of generating a job script to the SUSHI application, and the SUSHI application communicates with the Workflow Manager for job submission. SUSHI relies heavily on Ruby meta-programming in that the executable code is dynamically generated at run time. During the code execution, all graphical components are dynamically generated based on the SUSHI application. This process follows the basic principles of Ruby on Rails: DRY (Don’t Repeat Yourself) and CvC (Convention over Configuration). Namely, SUSHI application serves as the data source for the generation of the graphical components of the GUI. As a consequence every data analysis that is implemented as a SUSHI application is directly available in the web front-end of the SUSHI server. The implementation of a SUSHI application does not need any web development knowledge, so that the bioinformatician can focus on data analysis aspects.
The SUSHI server is implemented in Ruby on Rails and can be installed in one step including all dependencies. The default installation uses the WEBrick web server application and SQLite3 database management system. Alternatively, SUSHI can be configured to run using an Apache web-server and a MySQL backend. For more details, please refer to README.rdoc in the source repository, https://github.com/uzh/sushi.
Results and discussion
Use case RNA-seq data analysis
Comparison with existing systems
Various types of workflow management systems are compared
Galaxy file (.ga)
Chipster file (.bsh)
Chipster  is designed as a GUI application and it does not have CLI or batch process mode, so that a user must run an application one by one manually. GenePattern  and Taverna ) provide both GUI and CLI. These systems are designed for GUI usage but provide also command-line access through special clients with an application programming interface (API). The SUSHI application model on the other side is designed for command line usage, and the web front-end is auto-generated using the Ruby on Rails meta-process.
In terms of reproducibility and documentation, CLI systems tend to have the entire analysis information in a single text file while GUI systems come with a relational database management system (RDBMS) that provides this and additional functionality. Generally, the framework itself is required to run an application or workflow again to reproduce results. Different from that SUSHI provides the inherent advantages of GUI systems but still generates a set of human-readable shell scripts file that contain all the processing information and can be run independent of SUSHI (see the scripts in Additional files 1 and 3). In essence, SUSHI conveniently ensures full documentation and high reproducibility but is not needed to reproduce the analysis results.
Representation of meta Information
Accurate and high quality meta-information is necessary for the interoperability and integration of different data sources. However even 10 years after the minimal information about microarray experiment (MIAME)  guidelines have been established there is no consent how this should be implemented. This can be seen from the fact that different repositories and consortia use different implementations. Examples are the GEO SOFT , Sequence Read Archive (SRA) XML , ENCODE , modENCODE , and ISA-TAB . Given this situation, we decided that SUSHI should not implement its own constraints on meta-information. It is up to the data providers to decide, for example, which meta-information fields are provided and whether they are filled with free text, controlled vocabularies or terms from an ontology. SUSHI simply makes sure that the meta-information is preserved and that the analysis results are linked to the source data and accompanying meta-information. It goes without saying that we always encourage to use controlled vocabularies and ontologies wherever possible.
Productive use and user acceptance
We have introduced SUSHI step-wise at the Functional Genomics Center Zurich. Initially, we supported only the most frequently used analyses such as RNA-seq and SNP calling, and subsequently we supported other applications like small RNA-seq, ChIP-seq and de novo assembly. SUSHI was readily adopted by the bioinformaticians because they had an immediate direct net benefit without a compromise on the flexibility in terms parameter choices. SUSHI development was driven and shaped by direct user feedback, which also broadened the acceptance.
Advanced separation of SUSHI concerns
The separation of concerns (SoC) is a software design principle in computer science . It is now widely accepted and adopted in a variety of computer systems and software design such as object oriented programming and modularity of software design. The Model-View-Controller (MVC) design pattern is a typical example of separation of concerns for better software maintainability. The meta-level design is one type of advanced separation of concerns (ASoC) beyond object-oriented which can be seen in recent software paradigms such as generic programming, generative programming, and meta-programming. For example, the reflection architecture in Pattern-Oriented Software Architecture (POSA)  separates a system into meta-level and base-level and by controlling a meta-level it triggers a change at the base-level that actually provides a service to a user.
The current situation of the common NGS data analysis, such that several independent software applications are combined and chained to produce a final result, presents the two following main aspects: 1) which applications are used, and 2) how they are actually used. SUSHI separates these aspects (concerns) into a meta-process with meta-information (SUSHI application and DataSet object) and a base-process (shell scripts that do the data analysis). This separation of concerns results in the loose coupling of SUSHI into its system level and user application level. This improves the independence of SUSHI applications and shell scripts.
SUSHI targets both biologists and bioinformaticians as users. Analyzing data with SUSHI does not require programming skills, while adding new SUSHI applications requires only basic experience in writing scripts using the syntax of shell, R, Python, Ruby, or similar. SUSHI is particularly attractive to data analysis experts and bioinformaticians. SUSHI relieves users from administrative burdens and aids documentation and data organization. The full flexibility of the underlying tools stays untouched and can be directly accessed. Additionally, the fact that datasets are fully defined on the file system lets users prototype new workflows without the need to integrate those in the SUSHI instance. Finally, experts can automate data analysis tasks with the command line interface and are not limited to the web interface.
The design of the SUSHI system is driven by the idea of having analysis results fully defined and self-contained on the file system. In fact, if SUSHI is shut down, all the results can still be used and interpreted. All the meta-information is available in human readable tabular formats and all job scripts are contained with no back reference to the SUSHI framework. SUSHI provides no Laboratory Information Management System (LIMS) or computing functionality. Instead, through its open architecture it readily integrates with existing LIMS systems and computing resources.
In one solution, SUSHI provides at the same time fully documented, high level NGS analysis tools to biologists and an easy to administer, reproducible approach for large and complicated NGS data to bioinformaticians. This is mostly obtained by using the meta-level system design. Bioinformaticians will be freed from the boring tasks of managing software application and documentation and they can focus more on method development and on data analysis itself. The separation of the Workflow Manager from the SUSHI server makes the adaptation to any kind of computing facility easy and leaves the possibility to scale up.
The meta-level system design gives the simple but powerful framework of no data representation: SUSHI fully delegates the definition of dataset semantics to the user. The SUSHI system itself only defines how meta-information is used in a SUSHI application. It yields portability of datasets and lowers the barrier to reuse data and augment human readability of data set meta-information. The meta-level system design produces the decoupling of the meta-process from the base application process, which increases the degrees of freedom on the user side and contributes to the flexibility and scalability of the system.
Availability and requirements
Project name: SUSHI.
Project homepage: https://github.com/uzh/sushi.
Operating system(s): Platform independent but we recommend Unix-like system such as Ubuntu Linux and MacOS X.
Programming language: Ruby (> = 1.9.3).
Other requirements: Ruby on Rails (> = 3.2.9, < 4.0).
For the RubyGems library dependency, please refer to the Additional file 8, Gemfile.lock, which is also included in the Git repository.
API, application programming interface; ASoC, advanced separation of concerns; ChIP-seq, chromatin immunoprecipitation and next-generation DNA sequencing; CLI, command line interface; CvC, convention over configuration; DRY, don’t repeat yourself; DSL, domain specific language; GUI, graphical user interface; LIMS, laboratory information management system; MIAME, minimum information about a microarray experiment; MVC, model-view-controller; NGS, next generation sequencing; RDBMS, relational database management system; RNA-seq, RNA sequencing; SoC, separation of concerns; SRA, sequence read archive.
We would like to kindly thank the members of the IT and Genomics/Transcriptomics groups at the Functional Genomic Center Zurich for useful discussions and suggestions. Additionally we would like to thank all users for their valuable feedback.
No funding was obtained for this study.
Availability of data and material
Source code: https://github.com/uzh/sushi.
Public demo installation: http://fgcz-sushi-demo.uzh.ch.
HM and HR mainly wrote the manuscripts and all authors reviewed and revised the manuscripts. MH developed the core part of SUSHI system, namely SUSHI server, Workflow Manager, and SUSHI application super class. HR, LO, GR, and WQ validated the system and developed SUSHI applications. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Noble WS. A quick guide to organizing computational biology projects. PLoS Computational Biology. 2009;5:e1000424.View ArticlePubMedPubMed CentralGoogle Scholar
- Goecks J, Nekrutenko A, Taylor J, Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology. 2010;11:R86. doi:10.1186/gb-2010-11-8-r86.View ArticlePubMedPubMed CentralGoogle Scholar
- Fisch KM, Meißner T, Gioia L, Ducom J-C, Carland TM, Loguercio S, Su AI. Omics Pipe: a community-based framework for reproducible multi-omics data analysis. Bioinformatics (Oxford, England). 2015;31:1724-8. doi:10.1093/bioinformatics/btv061.View ArticleGoogle Scholar
- Halbritter F, Vaidya HJ, Tomlinson SR. GeneProf: analysis of high-throughput sequencing experiments. Nature Methods. 2012;9:7-8. doi:10.1038/nmeth.1809.View ArticleGoogle Scholar
- Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP. GenePattern 2.0. Nature Genetics. 2006;38:500-1. doi:10.1038/ng0506-500.View ArticlePubMedGoogle Scholar
- Sadedin SP, Pope B, Oshlack A. Bpipe: a tool for running and managing bioinformatics pipelines. Bioinformatics (Oxford, England). 2012;28:1525-6. doi:10.1093/bioinformatics/bts167.View ArticleGoogle Scholar
- Goodstadt L. Ruffus: a lightweight Python library for computational pipelines. Bioinformatics (Oxford, England). 2010;26:2778-9. doi:10.1093/bioinformatics/btq524.View ArticleGoogle Scholar
- McCoy CO, Gallagher A, Hoffman NG. nestly—a framework for running software with nested parameter choices and aggregating results. Bioinformatics (Oxford, England). 2013;29:387-8. doi:10.1093/bioinformatics/bts696.View ArticleGoogle Scholar
- Buske FA, French HJ, Smith MA, Clark SJ, Bauer DC. NGSANE: a lightweight production informatics framework for high-throughput data analysis. Bioinformatics (Oxford, England). 2014;30:1471-2. doi:10.1093/bioinformatics/btu036.View ArticleGoogle Scholar
- Yu L, Moretti C, Thrasher A, Emrich S, Judd K, Thain D. Harnessing Parallelism in Multicore Clusters with the All-Pairs, Wavefront, and Makeflow Abstractions. Journal of Cluster Computing. 2010;13:243-56.View ArticleGoogle Scholar
- Köster J, Rahmann S. Snakemake-a scalable bioinformatics workflow engine. Bioinformatics (Oxford, England). 2012;28:2520-2. doi:10.1093/bioinformatics/bts480.View ArticleGoogle Scholar
- Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010.Google Scholar
- Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology. 2013;14:R36.View ArticlePubMedPubMed CentralGoogle Scholar
- Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics (Oxford, England). 2013;29:15-21.View ArticleGoogle Scholar
- Liao Y, Smyth GK, Shi W. The Subread aligner: Fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research. 2013;41(10). doi:10.1093/nar/gkt214.
- Robinson MD, McCarthy DJ, Smyth GK. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2009;26:139-40.View ArticlePubMedPubMed CentralGoogle Scholar
- Kallio MA, Tuimala JT, Hupponen T, Klemelä P. Chipster: user-friendly analysis software for microarray and other high-throughput data. BMC Genomics. 2011;12. doi:10.1186/1471-2164-12-507.
- Wolstencroft K, Haines R, Fellows D, Williams A, Withers D, Owen S, et al. The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Research. 2013:W557-61. doi:10.1093/nar/gkt328.
- Edgar R, Barrett T. NCBI GEO standards and services for microarray data. Nature Biotechnology. 2006;24:1471-2.View ArticlePubMedPubMed CentralGoogle Scholar
- Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A. NCBI GEO: archive for functional genomics data sets-update. Nucleic Acids Research. 2013;41(Database issue):D991-5.View ArticlePubMedPubMed CentralGoogle Scholar
- NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research. 2016;44:D7-D19.View ArticlePubMed CentralGoogle Scholar
- Hong EL, Sloan CA, Chan ET, Davidson JM, Malladi VS, Strattan JS, et al. Principles of metadata organization at the ENCODE data coordination center. Database: The Journal of Biological Databases and Curation. 2016;2016:baw001. doi:10.1093/database/baw001.
- Washington NL, Stinson EO, Perry MD, Ruzanov P, Contrino S, Smith R, Zha Z, Lyne R, Carr A, Lloyd P, Kephart E, McKay SJ, Micklem G, Stein LD, Lewis SE. The modENCODE Data Coordination Center: lessons in harvesting comprehensive experimental details. Database. 2011;2011:bar023.View ArticlePubMedPubMed CentralGoogle Scholar
- Sansone S-A, Rocca-Serra P, Field D, Maguire E, Taylor C, Hofmann O, Fang H, Neumann S, Tong W, Amaral-Zettler L, Begley K, Booth T, Bougueleret L, Burns G, Chapman B, Clark T, Coleman L-A, Copeland J, Das S, de Daruvar A, de Matos P, Dix I, Edmunds S, Evelo CT, Forster MJ, Gaudet P, Gilbert J, Goble C, Griffin JL, Jacob D, et al. Toward interoperable bioscience data. Nature genetics. 2012;44:121-6.View ArticlePubMedPubMed CentralGoogle Scholar
- Türker C, Stolte E, Joho D, Schlapbach R. B-fabric: A data and application integration framework for life sciences research. In: Sarah Cohen-Boulakia VT, editor. Data integration in the life sciences, vol. 4544. Berlin Heidelberg: Springer; 2007. p. 37-47. doi:10.1007/978-3-540-73255-6_6 [Lecture Notes in Computer Science].View ArticleGoogle Scholar
- Türker C, Akal F, Joho D, Schlapbach R. B-Fabric: An Open Source Life Sciences Data Management System. Berlin Heidelberg: Springer; 2009. p. 185-90. doi:10.1007/978-3-642-02279-1_13.Google Scholar
- Dijkstra EW. Selected Writings on Computing: A Personal Perspective. New York: Springer; 1982. p. 60-6.View ArticleGoogle Scholar
- Buschmann F, Meunier R, Rohnert H, Sommerlad P, Stal M. Pattern-Oriented Software Architecture, a System of Patterns. Chichester, UK: Wiley Publishing; 1996.