ideal: an R/Bioconductor package for interactive differential expression analysis

Background RNA sequencing (RNA-seq) is an ever increasingly popular tool for transcriptome profiling. A key point to make the best use of the available data is to provide software tools that are easy to use but still provide flexibility and transparency in the adopted methods. Despite the availability of many packages focused on detecting differential expression, a method to streamline this type of bioinformatics analysis in a comprehensive, accessible, and reproducible way is lacking. Results We developed the ideal software package, which serves as a web application for interactive and reproducible RNA-seq analysis, while producing a wealth of visualizations to facilitate data interpretation. ideal is implemented in R using the Shiny framework, and is fully integrated with the existing core structures of the Bioconductor project. Users can perform the essential steps of the differential expression analysis workflow in an assisted way, and generate a broad spectrum of publication-ready outputs, including diagnostic and summary visualizations in each module, all the way down to functional analysis. ideal also offers the possibility to seamlessly generate a full HTML report for storing and sharing results together with code for reproducibility. Conclusion ideal is distributed as an R package in the Bioconductor project (http://bioconductor.org/packages/ideal/), and provides a solution for performing interactive and reproducible analyses of summarized RNA-seq expression data, empowering researchers with many different profiles (life scientists, clinicians, but also experienced bioinformaticians) to make the ideal use of the data at hand.

Differential expression analysis is a very commonly used workflow [4][5][6][7], whereby researchers seek to define the mechanisms for transcriptional regulation, enabled by the comparisons between, for example, different conditions, genotypes, tissues, cell types, or time points. The ultimate aim is to determine robust sets of genes that display changes in expression, and to contextualize them at the level of molecular pathways, in a way that can explain the biological systems under investigation and provide actionable insights in basic research and clinical settings [8].
Established end-to-end analysis procedures (such as [9][10][11]) are nowadays available, yet often bioinformatics analyses can be a challenging and time-consuming bottleneck, especially for users whose programming skills do not suffice to flexibly combine and customize the steps (and software modules) of a complete analysis pipeline.
In our previous work [21], we reviewed a selection of interfaces for RNA-seq analysis from the perspective of a life scientist, defining criteria that cover many essential aspects of every software/framework, including e.g. installation, usability, flexibility, hardware requirements, and reproducibility. Building upon these results, we subsequently developed a tool that satisfies a broad set of requirements for differential expression analysis and is presented in the following.
As a result of close collaborations with wet-lab life scientists and clinicians, we developed our proposal as an interactive Shiny [22] web based application in the ideal R/Bioconductor package, which guides the user through all operations in a complete differential expression analysis. ideal provides an integrated platform for extracting, visualizing, interpreting, and sharing RNA-seq datasets, similar to what our Bioconductor pcaExplorer package does for the fundamental step of exploratory data analysis [23].
The ideal package takes as input a count matrix and the experimental design information, for allowing to also analyze complex designs (such as multifactorial experimental setups), while making it easy to reproduce and share the analysis sessions, promoting effective collaboration between scientists with different skill sets, an open research culture [24], and the adherence to the FAIR Guiding Principles for scientific data [25]. Moreover, ideal delivers a wide range of information-rich visualizations, charts, and tables, both for diagnostic and downstream steps, which taken together form a comprehensive, transparent, and reproducible analysis of RNA-seq data.
ideal reprises and expands some design choices of pcaExplorer, with an improved documentation system based on tooltips and on the adoption of self-paced learning tours of the main functionality (with the rintrojs library [26]). State saving and automated HTML report generation via knitr and R Markdown, following a template bundled in the package itself (which can be edited by the experienced users to address specific questions), ensure code reproducibility, which has received increasing attention in recent years [27][28][29][30][31][32].
Similar to ideal, many of the existing tools accept count data in tabular format, and proceed to compute differentially expressed genes, accompanying this with visualizations (both focused on samples and on features) and sometimes downstream operations such as functional analysis for the identified subset. In most of the existing software, single genomic features of interest can be inspected, with some support for identifier conversion. The majority of these solutions are distributed as standalone web applications (commonly in R/Shiny, although some are written in Javascript); still, not all of these can be easily distributed as packages, or deployed seamlessly to private local instances. While still underrepresented, some of the tools allow the generation of an analysis product, which in many cases is based on a report in R Markdown [50], dynamically generated at runtime.
Overall, the existence of many such software packages highlights the need for a userfriendly framework to generate rich outputs for assisting analysis and interpretation, yet currently none of the existing proposals is offering the complete set of features we implemented in our work, with a full integration in the Bioconductor environment, and a seamless combination of interactivity and reproducibility.
The ideal package integrates and connects a number of R/Bioconductor packages, wrapping the current best practices in RNA-seq data analysis with a coherent user interface, and can deliver multiple types of outputs and visualizations to easily translate transcriptomic datasets into knowledge and insights. By leveraging the efficient core structures of Bioconductor, ideal allows flexible additional visualizations, as it is possible with custom scripts or with other GUI-based tools such as the iSEE package [51,52].
ideal is available at http://bioco nduct or.org/packa ges/ideal /, and the application can additionally be deployed as a standalone web-service, as we did for the publicly hosted version available at http://shiny .imbei .uni-mainz .de:3838/ideal , where the readers can explore the functionality of the app.

General design of ideal
ideal is written in the R programming language, wiring together the functionality of a number of widely used packages available from Bioconductor. ideal uses the framework of the DESeq2 package to generate the results for the Differential Expression (DE) step, as it was found to be among the best performing in many experimental settings for simple and complex eukaryotes [53,54]. Internally, this framework includes the estimation of size factors (with the median ratio method) and of the dispersion parameters, followed by the generalized linear model fitting and testing itself.
The web application and all its features are provided by a call to the ideal() function, which fully exploits the Shiny reactive programming paradigm to efficiently (re-) generate the rendered components and outputs upon detection of changes in the input widgets.
The layout of the user interface is built on the shinydashboard package [55], with a sidebar containing widgets for the general options, and the main panel structured in different tabs that mirror the different steps to undertake to perform a comprehensive differential expression analysis, from data setup to generating a full report. The task menu in the dashboard header contains buttons for state saving, either as binary RData files or as environments in the interactive workspace, accessible after closing the app.
Alongside features like tooltips, based on the bootstrap components in the shinyBS package [56], ideal uses collapsible elements containing text to quickly introduce the functionality of the diverse modules, and guided tours of the user interface via the rintrojs package [26], which provide means for learning-by-doing by inviting the user to perform actions that reflect typical use cases in each section. The Quick viewer widget in the sidebar keeps track of the essential objects, which are either provided upon launch, or computed at runtime, while valueBox elements (whose color turns from red to green once the corresponding object is available) above the main panel display a brief summary of each.
We invested particular attention in designing the application to guide the user through the different workflow steps (Fig. 1). This can be appreciated in the steps for the Data Setup panel, which appear dynamically once the required input and parameters are provided. Moreover, we used conditional panels to activate the functionality of each tab only if the underlying objects are available.
The base and ggplot2 [57] graphics systems are used to generate static visualizations, enabling interactions by brushing or clicking on them in the Shiny framework. Interactive heatmaps are generated with the d3heatmap [58] package, and tables are displayed as interactive objects for efficient navigation via the DT package [59].
We provide an R Markdown template for a complete DE analysis together with the package, and users can customize its contents by editing or adding chunks in the embedded editor (based on the shinyAce package [60]). Combining this object with the current status of the reactive widgets in the main tabs of the application, an HTML report is generated for preview at runtime, and can later be exported, shared with colleagues, or simply stored (Fig. 1, bottom section).
ideal has been tested on macOS, Linux, and Windows. It can be downloaded from the Bioconductor project page (http://bioco nduct or.org/packa ges/ideal /), and its development version can be found at https ://githu b.com/feder icoma rini/ideal /. Alternatively, ideal is also provided as a Bioconda recipe [61], simplifying the installation procedure in isolated software environments e.g. in combined use with Snakemake [62], with binaries available at https ://anaco nda.org/bioco nda/bioco nduct or-ideal .
Since ideal is normally installed on local systems, its speed and performance will vary depending on the hardware specifications available. In our experience, a typical modern laptop or workstation with at least 8/16 GB RAM is sufficient to run ideal on a variety of datasets. For example, in the analysis of the experimental dataset described in Additional File S2 (24 samples, with 4 treatments on 6 cell lines from different donors), the core functionality of DESeq2 required slightly less than 2 GB of RAM and less than one minute on a MacBook Pro with 2,9 GHz Intel Core i7 and 16 GB of memory, and required resources can be expected to scale approximately linearly with the increase of sample numbers -as measured with the profvis package [63]. The routines for functional annotation have a peak of allocated memory of ca. 4 GB, and take less than a minute to complete.
The desired depth of exploration after performing the backbone of the DE analysis is the main factor influencing the time required for completing a session with ideal, after familiarizing with its interface, e.g. by following the introductory tours on the demo dataset included.
The functionality of the ideal package is extensively described in the package vignette, regularly generated via the Bioconductor build system, and also embedded in the Welcome tab. Documentation for each function is provided, with examples rendered at the Github project page https ://feder icoma rini.githu b.io/ideal /, generated with the pkgdown package [64].

Typical usage workflow
During the typical usage session of ideal, users need to provide (or upload) two essential components: (1) a gene-level count matrix (countmatrix), a common  (2), the metadata table (expdesign) with the experimental variables for the samples of interest, as illustrated in the top panel of Fig. 1. ideal can accept any tabular text files, and uses simple heuristics to detect the delimiter used to separate the distinct fields; a preview on the uploaded files is shown in the collapsible boxes in the Step 1 of the Data Setup panel (Fig. 2a). A modal dialog informs the users about the formatting expected in the input files, with matched sample names and gene identifiers specified as column or row names. We strongly advise to perform a thorough exploratory data analysis on the input high-dimensional data, as this is a fundamental requirement in each rigorous analysis workflow. Users can refer to the pcaExplorer package for this purpose if an interactive approach is desired.
In the context of differential expression testing, the design argument has also to be specified, and this is normally a subset of the variables in the experimental metadata, which constitute the main source of information when submitting the data to a public repository such as the NCBI Gene Expression Omnibus (National Center for Biotechnology Information, https ://www.ncbi.nlm.nih.gov/geo/). All other parameters (the corresponding DESeqDataSet, DESeqResults, and a data frame containing matched identifiers for the features of interest) for the ideal() function can be also constructed a b c d , contains a text editor which displays the default comprehensive template provided with the package -which can additionally be edited by the user. In the framed content, a screenshot of the rendered report is included, e.g. with an annotated MA plot highlighting a set of specified genes manually on the command line and provided optionally, otherwise they will be computed at runtime. For demonstration purposes, we include a primary human airway smooth muscle cell lines dataset [65], which can be loaded in the Data Setup tab. For each module in the main application, ideal gives a text introduction to the typical operations, and then encourages the user to perform these in a guided manner by following the provided rintrojs tours, which can be started by clicking on a button. Descriptions of the user interface elements are anchored to the widgets themselves, and are highlighted in sequence while the interaction with them is enabled.
When the analysis session is terminated, binary RData objects and environments in the R session can store the exported reactive values. Additional analyses can be performed on the exported values, enabling e.g. alternative methods for functional enrichment, as illustrated in a section of the package vignette. While all result files and figures generated and displayed in the user interface can also be saved locally with few mouse clicks, the generation of a full interactive HTML report is the intended concluding step. This report is created by combining the values of reactive elements with the provided template, which can be extended by experienced users. Such a literate programming approach (conceived by [66] and perfected in the knitr package) is one of the preferred methods to ensure the technical reproducibility of computational analyses [67,68].
Additionally, users can continue exploring interactively the exported objects, if some representations are not included in ideal directly. A flexible interface to do so is represented by the iSEE Bioconductor package [51], which also fully tracks the code of the generated outputs, and we support this with a dedicated export function to a Summa-rizedExperiment object, with annotated rowData and colData slots filled with the results of the differential expression analysis.

Deploying ideal on a Shiny server
While we anticipate that the ideal package will typically be installed on local machines, it can also be deployed as a web application on a Shiny server, simplifying the workflow for users who want to analyze and explore their data without installing software. Deployment of an instance shared among lab members of the same research group is an exemplary use case; our proposal also supports protected instances behind institutional firewalls, e.g. if sensitive patient data is to be handled.
We describe the full procedure to set up ideal on a server and document the required steps in the GitHub repository https ://githu b.com/feder icoma rini/ideal _serve redit ion, which can be particularly useful for bioinformaticians or IT-system administrators. Following this approach, a publicly available instance has been created and is accessible at http://shiny .imbei .uni-mainz .de:3838/ideal for demonstration purposes, where users can either explore the airway dataset or upload their own data.

Results
The functionality of ideal is described in the next sections, and is illustrated in detail for the analysis of a human RNA-seq data of macrophage immune stimulation (published in [69]) in Additional file 2 (complete use case as HTML document structured like a vignette, with text, code, and output chunks).

Data input and overview
The setup for the data analysis is carried out in the Data Setup panel (Fig. 2a). To guide the user through the mandatory steps without an exceeding burden of interface elements, we designed this tab in a compact way, with boxes encapsulating related widgets, appearing consecutively once the upstream actions are completed. One of the fundamental data structures for the ideal app is a DESeqDataSet object, used in the workflow based on the DESeq2 package [9]. This is complemented by an optional annotation object, i.e. a simple data frame where different key types (e.g. ENTREZ, ENSEMBL, HGNC-based gene symbols) are matched to the identifiers for the features of interest. While this is not mandatory, it is recommended as some of the package functionality relies on the interconversion across such identifiers; ideal suggests the corresponding orgDb Bioconductor packages and makes it immediate to create such an object directly at runtime. Once the initial selections are finalized, the DESeq() command runs the necessary steps of the pipeline, displaying a textual summary and a mean-dispersion plot as diagnostic tools.
When dealing with large numbers of samples and more complex designs (entailing the computation of many coefficients), it is possible to take advantage of parallelized computation, as it is implemented for the DESeq2 package. ideal provides a slider to select the number of cores to use for running the main analysis, depending on the available resources (Step 3 in the Data Setup panel).
A first overview on the dataset, including a set of basic summary statistics on the expressed genes, as well as the (log transformed) normalized values can be retrieved in the Counts Overview tab, together with pairwise scatter plots of the values. Thresholds can be introduced to subset the original dataset by keeping only genes with robust expression levels, either based on the detection in at least one sample, or on the average normalized value.

Generating and exploring the results for differential expression analysis
The Extract Results tab provides the functionality to generate the other fundamental data structure, namely the DESeqResults object. After setting the FDR threshold in the sidebar, users are prompted to define the contrast of interest for their data, selecting one of the experimental factors included in the design. When the factor of interest has three or more levels available (e.g. the cell type in the airway demonstration dataset), the likelihood ratio test can be used instead of the Wald test, to allow for an ANOVAlike analysis across groups.
Further refinements to the results can be obtained by activating independent filtering [70], or selecting the more powerful Independent Hypothesis Weighting (IHW) framework [71], to ameliorate the multiple testing issue by incorporating an informative covariate, e.g. the mean gene expression [72]. Shrinkage of the effect sizes is also optionally performed on the log fold change estimates, to reflect the higher levels of uncertainty for lowly expressed genes. Interactive tables for the results are shown, with embedded links to the ENSEMBL browser and to the NCBI Gene portal to facilitate deeper exploration of shortlisted genes. Moreover, a number of diagnostic plots are generated, including histograms for unadjusted p-values, also using small multiples to stratify them on different mean expression value classes, a Schweder-Spjøtvoll plot [73], and a histogram of the estimated log fold changes.
More visualizations are included in the Summary Plots tab (Fig. 2b), where users can zoom in the MA plot (M, log 2 fold change vs A, average expression value) representation by brushing an area on the element. From the magnified subset, by clicking close to a selected gene, it is possible to obtain a gene expression boxplot (with the individual jittered observations superimposed), together with an info-box with details retrieved from the NCBI resource portal [74]. Heatmaps (both static and interactive) and volcano plots (log fold change vs log 10 of the p-value) deliver alternative views of the underlying result table, or interesting subsets of it.
Iterations oriented towards exploring a set of features of interest are made easier by the Gene Finder tab. Genes can be shortlisted on the fly, adding them from the sidebar selectize widget. For each feature of interest, a plot comparing the normalized values is displayed, and these are included in an annotated MA plot, where the selected subsets are highlighted on the plot and their values are shown in a corresponding table. Alternatively, a gene list can be uploaded directly as text file to obtain the same output, with the ease of providing entire sets of genes (e.g. a file with all cytokines, or a curated list of genes affected by a particular transcription factor) in one step.

Putting results into biological context
Many times it is challenging to make sense out of a carefully derived table of DE results, since it is not straightforward to identify the common biological themes that might be underlying the observed phenotypes. ideal offers different means to help researchers in meaningfully interpreting their RNA-seq data. The Functional Analysis panel offers three alternatives for gene set overrepresentation analysis, relying on topGO [75], goseq [76], or the goana() function in the limma package [7]; users can perform the enrichment tests on genes that are significantly differentially regulated, either split by direction of expression change, or combined in one list (Fig. 2c). Additionally, users can upload up to two custom lists of genes, which can be compared to the one derived from the result object, in order to detect significant overlaps among the sets of interest, which can be represented via Venn diagrams or UpSet plots [77].
The Gene Ontology (GO, [78]) terms enriched in each list can be interactively displayed, with links to the AmiGO database (http://amigo .geneo ntolo gy.org/), as well as heatmaps displaying the expression values for all the DE genes annotated to a particular signature.
Expanding on this functionality, ideal provides a Signatures Explorer panel, where a signature heatmap can be generated for any gene set provided in the gmt format, common to many sources of curated databases (MSigDB, WikiPathways). Conversion between identifier types is guided in the user interface, and so is the aspect of the final heatmap, where rows and columns can be clustered to better display existing patterns in the data, or transformations (such as mean centering or row standardization) help to bring the feature expression levels to a similar scale, for a better display in the final output.

Generating reproducible and transparent results
The focus in the development of ideal was on combining interactivity and reproducibility of the analysis. Therefore, we implemented the Report Editor as the toolset for enabling reproducible reports in the DE analysis step (Fig. 2d). The predefined template, embedded in the package, fetches the values of the reactive elements and the input widgets, thus capturing a snapshot of the ongoing session. Text, code, and results are all combined in an interactive HTML report, which can be previewed in the app, or subsequently exported.
This functionality is particularly appealing for less experienced users due to its automated simplicity, but experienced users can also take full benefit of it, by expanding the R Markdown document by adding or editing specific chunks of code.
State saving, activated by the buttons in the task menu in the header, stores the content of the current session into binary data objects or environments accessible from the global workspace.
As an additional feature, we leverage the flexibility of iSEE, the interactive Summa-rizedExperiment Explorer, another tool which fully supports intuitive and reproducible analyses, by assembling a serialized rds object that can be directly fed to the main iSEE() function for bespoke visualizations.

Discussion
The guiding principle for the development of our package ideal was the effective combination of usability and reproducibility, applied to one of the most widely adopted workflows in transcriptomics, i.e. the analysis of differentially expressed genes, followed by the downstream analyses based on functional enrichment among the subset of detected features.
Several software packages have been developed to operate on this tabular-like summarized expression data, or on formats which might derive from their results, and a comprehensive comparison of their features is presented in Additional file 1: Table S1. Notably, these tools differ by their set of included features (ranging from first exploration to downstream analysis steps), implementation (with R/Shiny, python, and JavaScript as main choices), format of distribution (packages, local web app, webserver), and ease of implementation in existing pipelines (e.g. by leveraging widely adopted class structures, or requiring and providing text files for portability across systems). The comparison with other tools is also available online (https ://feder icoma rini.githu b.io/ideal _suppl ement ), linked to a Google Sheet where the individual characteristics of each tool will be updated, in order to provide a tool for users who might be seeking advice on which solution to adapt for their needs (accessible at https ://docs.googl e.com/sprea dshee ts/d/167XV 0w18P 0FSld 1dt6o wN4C2 Esxl5 FU2QT o4D-wclz0 /edit?usp=shari ng).
Our proposal complements the existing pcaExplorer package, where the main exploratory data analysis steps are performed, and provides a platform for performing a complete differential expression analysis with the ease of interactivity, accompanied by a number of diagnostic plots, often overseen in other software tools. The combination of interactivity with reproducibility (Fig. 1, bottom panel) is an essential aspect to consider for generating robust and transparent analyses, substantiated by code which can also be used for didactic purposes, learning the best current practices with the state-of-the-art methods included in our package.
ideal fully supports widely used standard classes from Bioconductor, and thus allows seamless integration with many R packages for further downstream processing and within existing data analysis pipelines, while also benefiting from a thriving community of developers. ideal itself is part of Bioconductor, and thus is integrated into a build system that continuously checks all of its components and their interoperability, guaranteeing that the available set of features is correctly interfacing with the latest version of the package dependencies. Notably, the Bioconductor project enforces a number of best practices to enhance the usability of its components, with both internal and external documentation (for the individual functions and as complete tutorials, in form of vignettes), as well as providing unit test sets to ensure the software is working as expected: ideal adheres to these guidelines, which can be essential to define robust software [79] that can be adopted by a wide range of users.
A possible use case for deploying ideal is tightly related to data and result sharing. Distributing data in raw and processed format, together with a set of results, is becoming a practice that enables efficient data mining and can help ensure their reproducibility and reusability [25]. Stemming from a close collaboration with life and medical scientists, our tool allows researchers to share their work with other interested parties, starting from the operations during the collaboration phase, and continuing after publication, where broader audiences can effectively digest contents as they are presented from the authors.
The faster turnover in generating insights, thanks to the accessible interface and the multiple outputs, constitutes a significant advantage for reducing the time to new results, or alternatively for re-analyzing publicly available RNA-seq datasets. ideal provides a platform to facilitate discoveries in a standardized way, which at the same time improves the transparency and the reproducibility of the analyses. Indeed, one possible use case that we envision is the submission of a comprehensive notebook/report as Supplementary Material for a manuscript, so that the results are presented in a transparent manner, thus facilitating the contribution of reviewers, as well as the re-usability of analysis code. A rich technical description of parameters and software used would also greatly facilitate the writing of the "Material and Methods" sections, in a way that fully captures steps, parameters, code, and software versions.

Conclusion
The infrastructure provided by the ideal R/Bioconductor package delivers a web browser application that guarantees ease of use through interactivity and a dynamic user interface, together with reproducible research, for the essential step of differential expression investigation in RNA-seq analysis. The combination of these two features is a key factor for efficient, quick, and robust extraction of information, while leveraging the facilities available in the Bioconductor project in terms of classes and statistical methods.
The wealth of information that can be extracted while running the app may play a critical role when choosing the tools to adopt in a project. Still, to ensure the proper interpretation of the output results, the interaction of wet-lab scientists with collaborators with additional bioinformatics/biostatistics expertise is essential. The design choices for ideal aim at making this communication as robust and easy as possible, possibly defining this tool as the ideal way of approaching this step.
Following the criteria used in our previous overview on RNA-seq analysis interfaces [21], our package reaches out to the life/medical scientist, being simple to install and use, based on robust statistical methods, and offering multiple levels of documentation.
ideal allows scientists to easily take control of the analysis of RNA-seq data, while providing an accessible framework for reproducible research, which can be extended according to the user's needs.