Gaining insights from RNA-Seq data using iDEP 1

5 The analysis and interpretation of the RNA-Seq data can be time-consuming and challenging. We aim to 6 streamline the bioinformatic analyses of gene-level data by developing a user-friendly web application for 7 exploratory data analysis, differential expression, and pathway analysis. iDEP (integrated Differential 8 Expression and Pathway analysis) seamlessly connects 63 R/Bioconductor packages, 208 annotation 9 databases for plant and animal species, and 2 web services. The workflow can be reproduced by 10 downloading customized R code and related files. As demonstrated by two examples, iDEP (http://ge11 lab.org/idep/) democratizes access to bioinformatics resources and empowers biologists to easily gain 12 actionable insights from transcriptomic data. 13 14


1
Steven Xijin Ge 1,* and Eun Wo Son 1 2 1 Dept. of Mathematics and Statistics, South Dakota State University, Brookings, SD 57007 3 *To whom correspondence should be addressed. 4 5 The analysis and interpretation of the RNA-Seq data can be time-consuming and challenging. We aim to 6 streamline the bioinformatic analyses of gene-level data by developing a user-friendly web application for 7 exploratory data analysis, differential expression, and pathway analysis. iDEP (integrated Differential 8 Expression and Pathway analysis) seamlessly connects 63 R/Bioconductor packages, 208 annotation 9 databases for plant and animal species, and 2 web services. The workflow can be reproduced by 10 downloading customized R code and related files. As demonstrated by two examples, iDEP (http://ge-11 lab.org/idep/) democratizes access to bioinformatics resources and empowers biologists to easily gain 12 actionable insights from transcriptomic data. 13 14 Background 15 RNA sequencing (RNA-Seq) [1] is widely used for transcriptomic profiling. At increasingly reduced cost, 16 library construction and sequencing can often be easily carried out following standard protocols. For many 17 researchers, especially those without bioinformatics experience, the bottleneck to fully leverage the power 18 of the technique is how to translate expression profiles into actionable insights. A typical analytic workflow 19 involves many steps, each requiring several software packages. It can be time-consuming to learn, tune and 20 connect these packages correctly. Another hurdle is the scattered annotation databases which use diverse 21 types of gene IDs. To mitigate these issues, we aim to develop an application that can greatly reduce the 22 time and effort required for researchers to analyze RNA-Seq data. 23 RNA-Seq data analysis often starts with quality control, pre-processing, mapping and summarizing of 24 raw sequencing reads. We assume these steps were completed, using either the traditional Tuxedo Suite [2, 25 3] or alternatives such as the faster, alignment-free quantification methods [4,5]. These tools can be used 26 in stand-alone mood or through platforms like GenePattern [6] or web-based solutions such as Galaxy [7] 27 and iPlant/CyVerse [8]. 28 After read mapping, we often obtain a matrix of gene-level read counts or normalized expression levels 29 (Fragments Per Kilobase Million, or FPKM). For such tabular data, like DNA microarray data, R is a 30 powerful tool for visualization, exploratory data analysis (EDA), and statistical analysis. In addition, many 31 dedicated R and Bioconductor [9] packages have been developed to identify differentially expressed genes 32 (DEGs) and altered pathways. Some of the packages, such as DESeq2 [10], have been widely used. But it 33 can be time-consuming, or even out of reach for researchers without coding experience. 34 Several web applications have been developed to analyze summarized expression data (Table 1). START 35 App (Shiny Transcriptome Analysis Resource Tool) is a Shiny app that performs hierarchical clustering, 36 principal component analysis (PCA), gene-level boxplots, and differential gene expression [11]. Another 37 similar tool, Degust (http://degust.erc.monash.edu/) can perform differential expression analysis using 38 EdgeR [12] or limma-voom [13] and interactively plot the results. Other tools include Sleuth [14] and 39 ShinyNGS (https://github.com/pinin4fjords/shinyngs). Non-Shiny applications were also developed to 40 take advantage of the R code base. This includes DEIVA [15] and VisRseq [16]. Beyond differential 41 expression, several tools incorporate some capacity of pathway analysis. For quantified expression data, 42 ASAP (Automated Single-cell Analysis Pipeline) [17] can carry out normalization, filtering, clustering, and 43 enrichment analysis based on Gene Ontology (GO) [18] and KEGG [19] databases. With EXPath Tool [20], 44 users can perform pathway search, GO enrichment and co-expression analysis. The development of these 45 tools in the last few years facilitated the interpretation of RNA-Seq data . 46 In this study, we seek to develop a web application with substantially enhanced functionality with (1)  47  automatic gene ID conversion with broad coverage, (2) comprehensive gene annotation and pathway  48  database for both plant and animals, (3) several methods for in-depth EDA and pathway analysis, (4) access  49 to web services such as KEGG [19] and STRING-db [21] via application programming interface (API), 50 and (5) improved reproducibility by generating R scripts for stand-alone analysis. 51 Leveraging many existing R packages (see Figure 1) and the efficiency of the Shiny framework, we 52 developed iDEP (integrated Differential Expression and Pathway analysis) to enable users to easily 53 formulate new hypotheses from transcriptomic datasets. Taking advantage of the massive amount of gene  54 annotation information in Ensembl [22,23] and Ensembl Plants [24], as well as pathway databases 55 compiled by our group [25,26] and others [19,27,28], we were able to build a large database to support 56 in-depth pathway analyses. 57 We used iDEP to analyze two example datasets and generate all the figures and tables in this paper except 58 Table 1 and Figure 1. We first used it to extensively analyzed a simple RNA-Seq dataset involving small 59 interfering RNA (siRNA)-mediated Hoxa1 knockdown in human lung fibroblasts [3]. We identified the 60 down-regulation of cell-cycle genes, in agreement with previous studies. Our analyses also reveal the 61 possible roles of E2F1 and its target genes, including microRNAs, in blocking G1/S transition, and the 62 upregulation of genes related to cytokines, lysosome, and neuronal parts. The second dataset was derived 63 from an experiment with a factorial design to study the effect of ionizing radiation (IR) on mouse B cells 64 with and without functional TRP53 [29]. In addition to correctly identifying p53 pathway and the 65 enrichment of p53 target genes, we also found the p53-independent effects, including the regulation of 66 ribosome biogenesis and non-coding RNA metabolism, and activation of c-MYC. These examples show 67 that users can gain insights into both molecular pathways and gene regulatory mechanisms. 68

69
We developed an easy-to-use web application that encompasses many useful R and Bioconductor packages, 70 vast annotation databases, and related web services. The input is a gene-level expression matrix obtained 71 from RNA-seq, DNA microarray, or other platforms. Main functionalities include (1) pre-processing, (2) 72 EDA, (3) differential expression, and (4) pathway analysis and visualization. 73 We tried to develop an intuitive, graphical, and robust tool so that researchers without bioinformatics 74 experience can routinely and quickly translate expression data into novel hypotheses. We also wanted to 75 make can serve as a tool for preliminary analysis as it circumcises the need for many tedious tasks such as 78 converting gene IDs and downloading software packages and annotations. These users can also download 79 customized R scripts and related data files so that the analysis can be reproduced and extended. 80 iDEP Design 81 Figure 1 outlines the iDEP workflow. Expression matrix is first filtered, transformed and converted to 82 Ensemble gene IDs, which are used internally to identify genes. The pre-processed data is then used for 83 EDA, such as K-means clustering, hierarchical clustering, principal component analysis (PCA), t-SNE [30]. 84 Gene clusters identified by K-means are analyzed by enrichment analysis based on a large gene annotation 85 and pathway database. The identification of DEGs is done with either the limma [31] or DESeq2 [10] 86 packages. This is also followed by enrichment analysis on the DEGs. The fold-change values are then used 87 in pathway analysis using several methods. 88   To enable gene ID conversion, we downloaded all available gene ID mappings for 208 species from  89 Ensembl [22,23] (Table S1 in  Besides common ID like gene symbol, Entrez, Refseq, UCSC, UniGene, and Interpro IDs, the 67 kinds of 94 human gene IDs also include probe IDs for popular DNA microarray platforms, making it possible to re-95 analyze thousands of microarray datasets available at public repositories. 96 In the pre-processing stage, gene IDs are first compared to all gene IDs in the database for 208 organisms. 97 This enables automatic ID conversion and species identification. Genes expressed at very low levels are 98 removed and data are transformed as needed using one of several methods. iDEP enforces log-99 transformation when a highly skewed distribution is detected. This type of mechanisms can help avoid 100 issues in downstream analyses. The pre-processing stage also generates diagnostic and summary plots to 101 guide users to make their choices. 102 EDA enables the users to explore variations and patterns in the dataset as a whole [32]. The main methods 103 include hierarchical clustering with heatmap, k-means clustering, and PCA. Enrichment analysis of genes 104 derived from k-means clustering is conducted to gain insights into the functions of co-expressed genes. 105 Initial attempts of pathway analysis are carried out using the PCA loadings on each gene. This can tell us 106 the biological processes underlying each direction of expression change defined by the principal 107 components. 108 Differential expression analysis relies on two Bioconductor packages, limma [31] and DESeq2 [10]. 109 These two packages can meet the needs for most studies, including those involving multiple biological 110 samples and factorial design. Normalized expression data is analyzed using limma. Read counts data can 111 be analyzed using three methods, namely limma-trend [13], limma-voom [13,33], and DESeq2. Other 112 methods such as edgeR [12] may be incorporated in the future. 113 For simple study designs, iDEP runs differential gene expression analysis on all pairs of sample groups, 114 which are defined by parsing sample names. For complex studies, users can upload a file with experiment 115 design information and then build statistical models that can involve up to 6 factors. This also enables users 116 to control for batch effects or dealing with paired samples. 117 Fold-change values for all genes returned by limma or DESeq2 are used in pathway analysis using GSEA 118 [34], PAGE [35,36], GAGE [37] or ReactomePA [38]. Taking advantage of centralized annotation 119 databases for 208 species at Ensembl, Ensembl Plants, and Ensembl Metazoa, we downloaded not only GO 120 functional categorizations, but also promoter sequences for defining transcription factor (TF) binding motifs 121 for most species. Metabolic pathways were downloaded directly from KEGG [19] for 117 species (Table  122 S1 in Supp. File 1). Also, we incorporated Pathview package [39] to show gene expression on KEGG 123 pathway diagrams downloaded via API. In addition, we also included many species-specific pathway 124 knowledgebases, such as Reactome [28,38], GeneSetDB [40] and MSigDB [27] for human, GSKB for 125 mouse [25], and araPath for Arabidopsis [26]. These databases contain diverse types of gene sets, ranging 126 from TF and microRNA target genes to manually curated lists of published DEGs. For the human genome, 127 we collected 72,394 gene sets ( Table 2). Such comprehensive databases enable in-depth analysis of 128 expression data from different perspectives. 129 iDEP also enables users to retrieve protein-protein interaction (PPI) networks among top DEGs via an 130 API access to STRING [21]. These networks can be rendered both as static images and as richly annotated, 131 interactive graphs on the STRING website. The API access also provides enrichment analysis (GO, KEGG, 132 and protein domains) for 115 archaeal, 1678 bacterial, and 238 eukaryotic species, thus greatly expanding 133 the species coverage of iDEP. 134 Based on their chromosomal location obtained from Ensembl, we visualize fold-changes of genes on all 135 the chromosomes as an interactive graph based on Plotly. iDEP can also use the PREDA package [41] to 136 detect chromosomal regions overrepresented with up-or down-regulated genes. This is useful for studies 137 such as cancer that might involve chromosomal deletion or amplification. 138 For larger datasets, users can use bi-clustering algorithms to identify genes with correlated expression 139 among a subset of samples, using the 8 methods implemented in 3 Bioconductor packages biclust [42], 140 QUBIC [43], and runibic [44]. Gene co-expression networks can also be constructed with the WGCNA 141 package [45]. Enrichment analysis is routinely conducted on gene clusters derived from these methods. 142 To ensure the ease of deployment, we used docker containers to configure the web server. This technology 143 also enables us to easily clone the service many times to take advantage of the multiple cores. Load balanced 144 with Nginx, our web server can handle hundreds of concurrent users. The source code for iDEP and our 145 server configuration files are available at our Github repository (https://github.com/iDEP-SDSU/idep). 146 Use case 1: a simple experiment on Hoxa1 knockdown 147 We first analyzed a simple dataset related to Hoxa1 knockdown by siRNA in human lung fibroblasts [3]. 148 With 3 replicates for each of the two biological samples, this RNA-Seq dataset was used as example data 149 for Cuffdiff2 [3]. Available as Supplementary File 2, the read count data was previously used in a tutorial 150 for pathway analysis [46]. 151

Pre-processing and EDA 152
After uploading the read count data, iDEP correctly recognized Homo sapiens as the likely species with the 153 most genes IDs matched. After ID conversion and the default filter (0.5 counts per million, or CPM, in at 154 least one sample), 13,819 genes left. A bar plot of total read counts per library is generated (Figure 2A), 155 showing some small variation in library sizes. We chose the regularized log (rlog) transformation 156 implemented in the DESeq2 package, as it effectively reduces mean-dependent variance ( Figure S1 in Supp. 157 File 3). Distribution of the transformed data is shown in Figure 2B-C. Variation among replicates is small 158 ( Figure 2D). 159 iDEP also enables users to examine the expression level of one or more genes. Using "Hoxa" as a keyword, 160 we obtain Figure 3A, which shows that Hoxa1 expression level is reduced, but not abolished, in response 161 to siRNA-mediated knockdown of Hoxa1. Noticeably, expression of other family members, especially 162 Hoxa2, 4, and 5, also decrease. As these genes have similar mRNA sequences, it is unclear whether this is 163 caused by off-target effects of the siRNA or ambiguous mapping of RNA-Seq reads. Figure 3B, obtained 164 by using "E2F" as a keyword, shows the down-regulation of E2F1. 165 For clustering analysis, we rank genes by their standard deviation across all samples. The result of 166 hierarchical clustering of the top 1000 genes is shown in Figure 4A, which suggests that Hoxa1 knockdown 167 in lung fibroblast cells induce a drastic change in the expression of hundreds of genes. Variations among 168 technical replicates are minimal. These observations can also be confirmed by the correlation matrix (Supp. 169 Figure 2) and k-means clustering (Supp. Figure 3). 170 PCA plot using the first and second principal components is shown in Figure 4B. There is a clear 171 difference between the Hoxa1 knockdown and the control samples, along the first principal component that 172 explains 93% of the variance. Plot using multidimensional scaling (MDS), and t-SNE [30] also show a 173 similar distribution of the samples (Supp. Figure 4). We can choose to conduct pathway analysis using 174 PGSEA [35,36] by treating the loadings of the principal components as expression values. As suggested 175 by Supp. Figure 5, the first two components are related to cell cycle regulation. 176

Differentially expressed genes (DEGs) 177
With the DESeq2 method, we identified 907 upregulated and 1097 downregulated genes (see Supp. Table  178 3) using a threshold of false discovery rate (FDR) < 0.1 and fold-change >2. The volcano plot ( Figure 5A) 179 and the M-A plot ( Figure 5B) show that Hoxa1 knockdown leads to a massive transcriptomic response. 180 Plotly-based interactive versions of these plots are also available, where users can zoom in and mouse over 181 to see individual genes ( Figure 5C). A quick scan at the top genes ranked by the absolute values of fold-182 change (FCs) tells us that Hoxa1 knockdown induces cytokines (IL1B, IL24). 183 The up and down-regulated genes are then subjected to enrichment analysis based on the hypergeometric 184 distribution. Many different types of genes sets listed in Table 2 can be used to test various hypotheses. The 185 GO Biological Process terms enriched in DEGs are shown in Table 3. Upregulated genes are related to 186 regulation of cell proliferation, locomotion, and response to endogenous stimuli. This is perhaps the cell's 187 response to injected siRNAs. The downregulated genes are significantly enriched with cell cycle-related 188 genes (FDR < 2.6x10 -47 ). The effect of Hoxa1 knockdown on cell cycle was reported and experimentally 189 confirmed in the original study [3]. Cell cycle analysis revealed that loss of Hoxa1 leads to a block in G1 190 phase [3]. 191 As many GO terms are related or redundant (i.e., cell cycle and cell cycle process), we provide two plots 192 to summarize such correlation [47]. We first measure the distance among the terms by the percentage of 193 overlapped genes. Then this distance is used to construct a hierarchical tree ( Figure 6A) and a network of 194 GO terms ( Figure 6B). Both plots show that the enriched terms are distinct in the two gene lists. The down-195 regulated genes are overwhelmingly involved in cell cycle. The upregulated genes are related to 4 related 196 themes: cell proliferation, signaling, response to organic substance, and cell migration, possibly in reaction 197 to the injected siRNAs. 198 Choosing GO cellular component, we find that Hoxa1 knockdown suppresses genes that code for the 199 spindle, cytoskeleton and chromosomal parts (Supp. Figure 6). As Hoxa1 knockdown blocks G1/S transition 200 [3], a smaller number of cells are in the S (synthesis) phase, leading to the reduction of proteins related to 201 the spindle and chromosomal parts. Hoxa1 knockdown also induces genes related to plasma membrane, 202 neurons and synapses (Supp. Figure 6). This unexpected result is consistent with Hoxa1's role in neuronal 203 differentiation [48,49]. Polymorphisms of this gene are associated with cerebellar volume in humans [50]. 204 Hoxa1 may have different functions in various organs across developmental stages. 205 Choosing KEGG pathway, we confirm the overrepresentation of cell cycle-related genes in 206 downregulated genes (Supp. Figure 7). For up-regulated genes, we detect cytokine-cytokine receptor 207 interaction (CCRI) pathway (FDR<1.3×10 -10 ). "MSigDB.Curated" gene sets contain pathways from 208 various databases as well as published lists of DEGs from previous expression studies [27]. As shown in 209 Supp. Figure 8, the most significant are oligodendrocyte differentiation and several cell-cycle related gene 210 sets. As suggested by a meta-analysis of published gene lists [51], cell-cycle related expression signature is 211 frequently triggered by diverse cellular perturbations [52]. We uncovered similarity of our expression 212 signature with previously published ones. 213 Using the STRINGdb Bioconductor package, iDEP can analyze the lists of DEGs via the STRING API 214 [21] for enrichment analysis and the retrieval of PPI networks. The enrichment analysis led to similar results 215 (Supp. Table 4) to those obtained using the internal iDEP gene sets. In addition, STRING detected that the 216 Helix-loop-helix DNA-binding domain is overrepresented in proteins coded by the 907 upregulated genes, 217 while the Tubulin/FtsZ family, GTPase domain is enriched in the 1097 down-regulated genes (Supp . Table  218 S5). Figure 7 is the network of PPIs among the top 20 upregulated genes. The highly connected network 219 includes chemokine ligands 1 and 3 (CXCL1 and CXCL3), as well as interleukin 24 (IL24), suggesting the 220 immune response caused by injected siRNA. A link to an interactive version of this network on the STRING 221 website is also available for access to rich annotation information including protein structures and literature. 222 iDEP can also reveal gene regulatory mechanisms. Using the TF target gene sets in enrichment analyses, 223 we can obtain Table 4, which suggest that binding motifs for SP1 (FDR<9.80×10 -23 ) and E2F factors 224 (FDR<1.1×10 -16 ) are enriched in the promoters of down-regulated genes. E2F factors are regulators of cell 225 cycle [53]. E2F1 promotes G1/S transition [54] by regulation many genes, including itself. SP1 binding 226 sites were identified in cell-cycle related genes such as Cyclin D1 (CCD1) [55]. SP1 is a G1 phase specific 227 TF [56]. The interaction of E2F1 and SP1 proteins mediate cell cycle regulation [57]. The upregulated 228 genes are enriched with target genes of NF-κB (FDR<4.9×10 -9 ) and FOXO3(FDR<4.9×10 -9 ), known to be 229 regulators of the immune response [58,59]. 230 The Motif gene sets from MSigDB are derived from [60] and contain sets of genes sharing TF binding 231 motifs in gene promoters and microRNA target motifs in 3' untranslated regions (UTRs). Using this gene 232 set, we again detect the enrichment of E2F motifs in promoters of downregulated genes (Supp .Table S10). 233 We also detected overrepresentation of a "GCACTTT" motif in 3' UTRs of upregulated genes. This motif 234 is targeted by several microRNAs, namely miR-17-5P, miR-20a, miR-106a. Cloonan et al. showed that 235 miR-17-5P targets more than 20 genes involved in the G1/S transition [61]. Trompeter et al. provided 236 evidence that miR-17, miR-20a, and miR-106b enhance the activities of E2F factors to influence G1/S 237 transition [62]. miR-106b resides in the intron of Mcm7 along the sense direction. Mcm7 is an E2F1 target 238 gene that is also downregulated by Hoxa1 knockdown (see Fig. 8A). Petrocca et al. showed that E2F1 239 regulates miR-106b, which can conversely control E2F1 expression [63]. Thus, it is possible that Hoxa1 240 knockdown reduces E2F1 expression (see Figure 3B) and its target genes, including Mcm7, which hosts 241 miR-106b. We can speculate that downregulated miR-106b, in turn, causes the increases in the expression 242 of its target genes. Leveraging the comprehensive pathway databases, iDEP enables researchers to develop 243 new hypotheses that could be further investigated. 244 For many species, predicted TF target genes are not available. As a solution, we downloaded promoter 245 300bp and 600bp sequences from ENSEMBL and scanned them with a large collection of TF binding 246 motifs [64]. As shown in Table 5, the promoters of DEGs are overrepresented with many G-rich motifs 247 bound by E2F and other factors such as TCFL5 and SP2. We compared the best possible scores for each 248 TF and promoter pair and run t-tests to compare these scores. Further study is needed to validate this 249 approach. 250 For human (Table 2), mouse [25] and Arabidopsis [26], we also include predicted target genes for many 251 miRNAs from multiple sources. Using these gene sets, we detected significant enrichment (Table 6) of  252 miRNA-193b, miR-192, and miR-215 target genes among the down-regulated genes. miR-193b was shown 253 to suppress cell proliferation and down-regulate CCND1 [65]. Proposed as biomarkers of several cancers, 254 miR-192 also inhibit proliferation and can cause cell cycle arrest when overexpressed [66]. miR-215 shares 255 many target genes with miR-192 and are also downregulated in cancer tissues [67]. These miRNAs may 256 play a role in the regulation of cell cycle upon Hoxa1 knockdown. 257

Pathway analysis 258
Instead of using selected DEGs that are sensitive to arbitrary cutoffs, pathway analysis can use fold-change 259 values of all genes to identify coherently altered pathways. We used the GAGE [37] as method and KEGG 260 as gene sets. The results (Supp . Table S6) is similar to those from a previous analysis by Turner in an online 261 tutorial [46] and also agrees with our enrichment analysis based on DEGs. For each of the significant KEGG 262 pathways, we can view the fold-changes of related genes on a pathway diagram using the Pathview 263 Bioconductor package [39]. Many cell cycle genes are marked as green in Figure 8, indicating reduced 264 expression in Hoxa1-knockdown samples. We also detected upregulation of genes related to CCRI, arthritis, 265 and lysosome. Many CCR related genes are up-regulated ( Figure 9). Not detected using DEGs, lysosome-266 related genes are mostly upregulated (Supp. Fig. 9). Injected siRNAs might be degraded in the lysosome. 267 By changing the gene sets database for pathway analysis, we can gain further insights. Using 268 MSigDB.Motif gene sets, we can verify the enrichment of E2F binding motifs (Supp. Table 7). For non-269 KEGG gene sets, heatmaps are created to show the expression of genes in significant gene sets. Figure 10A  270 shows part of such a plot, highlighting genes that share the "SGCGSSAAA" motif bound by E2F1. Note 271 that E2F1 gene itself is included in the figure, as it binds to its own promoter and forms a positive feedback 272 loop [54]. The downloaded expression data indicate that E2F1 is downregulated by more than 3-fold in 273 Hoxa1 knockdown samples (see Figure 3B). Upon Hoxa1 knockdown, downregulation of E2F1 and 274 downstream genes, including microRNAs, may be part of the transcription program that blocks G1/S 275 transition. 276 Users can use many combinations of methods and gene sets to conduct pathway analysis. For example, 277 using PGSEA on KEGG pathways yielded Figure 10B, again confirming previous results on suppressed 278 cell cycle genes and induced lysosome and CCRI related genes. Using the MSigDB.Motif gene sets, we 279 can also confirm the E2F1 binding motifs ( Figure 10C). The most highly activated gene sets are related to 280 miR-17-5p, miR-20a, miR106a,b and so on ( Figure 10C), which agrees with enrichment analysis using just 281 gene lists. 282 Some pathways can be attenuated by upregulating some of its associated genes while downregulating 283 others. To detect such pathways, we can use the absolute values of fold changes in pathway analysis. This 284 is achieved by checking the box labeled "Use absolute values of fold changes for GSEA and GAGE." 285 Instead of detecting up or down-regulated pathways, the results show which pathways are more regulated. 286 As shown in Supp. Table S8, while the expression of ribosome related genes is less variable upon Hoxa1  287 knockdown, genes related to CCRI are highly regulated. 288 The expression of neighboring genes can be correlated, due to many mechanisms, including super-289 enhancers [68], 3D chromatin structure [69], or genomic gain or loss in cancer. To help users detect such 290 correlation, we use ggplot2 [70] and Plotly to interactively visualize fold-changes on all the chromosomes 291 ( Figure 11A). Based on regression analysis, the PREDA package [41] can detect statistically significant 292 chromosomal regions with coherent expression change among neighboring genes. Figure 11B shows many 293 such regions in response to Hoxa1 knockdown. Detailed information obtained from downloaded files (Supp. 294 Table S9) suggests, for example, a 4.3 Mbps region on Chr.1q31 contains 6 upregulated genes (PRG4, TPR, 295 C1orf27, PTGS2, PLA2G4A, and BRINP3). 296 To further validate our parameterization of PREDA, we analyzed DNA microarray data (Supp. File 4) of 297 thymus tissues from patients with Down syndrome [71]. We detected large, upregulated regions on 298 chromosome 21 (Supp. Fig. 10), as expected. Even though PREDA analysis is slow and has low-resolution 299 due to the use of gene-level expression score, it might be useful in cancer studies where localized expression 300 change on the chromosome can happen. 301 To improve reproducibility, iDEP generates custom R code and R Markdown code based on user data 302 and choices of parameters (Supp. Files 5-7). Users with some R coding experience should be able to re-run 303 most analyses by downloading the related annotation and gene sets used by iDEP. An example is shown 304 here [72]. 305 306 Tonelli et al. [29] used RNA-Seq to study the effect of whole-body ionizing radiation (IR) on the mouse 307 with or without p53. B cells and non-B cells were isolated from mouse spleen after treatment. We analyzed 308 the B cell data involving two genotypes (p53 wildtype and p53 null) with mock or IR treatment, a typical 309 2x2 factorial design. The read count and experimental design files are available as Supp. Files 8 and 9. A 310 converted, filtered version of this dataset is incorporated into iDEP as a demo data. 311

Use case 2: factorial design on p53's role in genotoxic stress
With this dataset, we demonstrate how users can easily generate hypothesis on molecular pathways and 312 gene regulatory mechanisms through interaction with their data at three main steps: enrichment analysis of 313 k-means clusters, enrichment analysis of the lists of DEGs, and pathway analysis using fold-changes values 314 of all genes. 315 Pre-process and EDA of p53 dataset 316 We noticed reduced total reads for wildtype samples treated with IR ( Figure 12A). While this may be caused 317 by biology, but biased sequencing depth presents a confounding factor, that has not been discussed widely. 318 To quantify such biases, iDEP routinely performs ANOVA analysis of total read counts across sample 319 groups. For this example, uneven read counts are detected (P=0.047) and a warning is produced. 320 EDA shows that IR treatment led to the changes in thousands of genes. Based on the distribution of 321 variances ( Figure 12B), we choose the top 2500 genes for clustering analysis. Hierarchical clustering 322 ( Figure 13) shows the substantial differences between treated and untreated samples. It also shows the 323 patterns of different groups of genes and the variations among some replicates of treated wild-type cells 324 (wt_IR). 325 We then used k-means clustering to divide the top 2500 genes into groups. Based on the within-group 326 sum of squares plot (Supp. Figure 11) as a reference, we chose a slightly larger k=9. Figure 14 shows the 327 gene clusters and the enriched GO terms. Details are available in Supp. Tables S10 and S11. Genes in 328 clusters B and I show similar responses to IR across genotypes. Strongly enriched in genes related to the 329 immune system (FDR < 3.65×10 -18 ), cluster B are downregulated by IR in both cell types. The immune-330 suppressive effects of radiation [73] are clearly p53-independent. Induced by IR in both wildtype and Trp53 -331 /cells, cluster I genes are enriched in ribosome biogenesis but with much lower level of significance (FDR 332 < 2.25×10 -5 ). 333 On the other hand, genes in clusters A, C, and D are specific to the wild-type cells. Cluster A contains 13 334 genes that code for histone proteins and are involved in nucleosome assembly (FDR<1.66×10 -11 ). Genes in 335 Clusters C and D are induced by IR only in B cells with p53, but the former is more strongly upregulated. 336 As expected, cluster C is related to the p53 pathway (FDR<1.38×10 -10 ) and apoptosis (FDR<3.59×10 Identifying DEGs in the p53 dataset 348 To identify genes induced by IR in both cell types, users can use pair-wise comparisons among the 4 sample 349 groups. Alternatively, we can construct linear models through the GUI. Here we use the following model: 350 Expression ~ p53 + Treatment + p53:Treatment, 351 where the last term represents the interaction between genotype and treatment, capturing the additional 352 effects of p53 in IR response. It is important to set the reference levels for factors in a model. Here we set 353 the null (Trp53 -/-) as a reference level for the factor "p53" and mock for the factor "Treatment". 354 With FDR<0.01 and fold-change > 2 as cutoffs, we used DESeq2 to identify DEGs ( Figure 15A and B). 355 Without treatment, the two cell types have similar transcription profiles, with few DEGs. But even in Trp53 -356 /cells, IR caused the upregulation of 1570 genes, 469 of which is also upregulated in p53 wildtype B cells 357 (see Venn diagram in Figure 15C). 358 To further understand the molecular pathways, we perform enrichment analysis of the 10 gene lists (Supp.  359  Table S12) associated with 5 comparisons. We focus on two comparisons (1) "IR-mock" representing the 360 baseline response of IR in mutant cells without p53, and (2) "I:p53_wt-Treatment_IR", the interaction term 361 capturing the additional effect of p53 compared to the baseline response. 362 For the first comparison, Figure 16 shows IR induced DEGs in mutant cells. The 1570 upregulated genes 363 are related to non-coding RNA (ncRNA) metabolic process (FDR<1.33×10 -79 ), ribosome biogenesis 364 (FDR<2.54×10 -67 ), and translation (FDR<3.03×10 -32 ). This enrichment profile is similar to cluster H 365 derived from the k-Means clustering, as the two lists capture the same group of genes. The upregulated 366 genes are surprisingly coherent in function. For example, 219 (14%) can be found in the nucleus, 286 (18%) 367 is related to the mitochondrion, and, most significantly, 407 (26%) is RNA-binding (FDR<3.54×10 -138 ). 368 The 1570 upregulated genes contain 7 MYC target genes (FDR<4.22×10 -7 ), consistent with the fact that 369 MYC is a direct regulator of ribosome biogenesis [74]. This agrees with reports of the involvement of MYC 370 in radiation treatment [75,76], suggesting MYC may trigger proliferation pathways upon genotoxic stress, 371 in the absence of p53. 372 Genes downregulated by IR in Trp53 -/-B cells are related to immune system (FDR < 4.22×10 -8 ), GTPase 373 activity (FDR < 3.75×10 -6 ), and actin cytoskeleton (FDR < 2.06×10 -5 ). As shown in Supp. Table S13, we  374 can also detect the enrichment of the target genes of miR-124 (FDR < 4.56×10 -12 ), an important modulator 375 of immunity [77]. Others associated miRNAs, including miR-6931-5p, Mir-4321, and miR-576-5p, may 376 also be involved. 377 For the second comparison, the expression profiles of DEGs associated with the interaction term is shown 378 in Figure 17. This is the p53 mediated IR response, compared to the baseline response without p53. The 379 676 genes that are upregulated in wild-type B cells following IR, but not in Trp53 -/-B cells. As expected, 380 these genes are enriched in p53-mediated response to DNA damage (FDR < 1.43×10 -6 ), and apoptosis (FDR 381 < 9.72×10 -6 ). As shown in Supp. Table S13, these genes are overrepresented with 25 target genes of p53 382 (FDR < 1.34×10 -13 ) and 76 target genes of miR-92a (FDR < 2.79×10 -11 ). Part of the miR-17/92 cluster, 383 miR-92a is related to tumorigenesis and is regulated by p53 [78,79]. Another miRNA with overrepresented 384 target genes is miR-504 (FDR < 3.25×10 -8 ), which has been shown to binds to 3' UTR of Trp53 and 385 negatively regulate its expression [80]. Located in the introns of the fibroblast growth factor 13 (FGF13) 386 gene, miR-504 is transcriptionally suppressed by p53, forming a negative feedback loop [81]. Following 387 radiation, the expression of both miR-92a and miR-504 in wild-type B cells may be reduced, leading to the 388 upregulation of their target genes. Further study is needed to verify this hypothesis. 389 As shown in Figure 17, the 584 genes downregulated according to the interaction term are those that are 390 induced in the Trp53 -/-B cells, but not in wild-type B cells. These genes are overrepresented with ncRNA 391 processing, ribosome biogenesis, cell cycle, and RNA transport (Supp . Table S14). Most (411) of the 584 392 genes are included in the genes upregulated by IR in Trp53 -/-B cells (Supp. Figure 12). MYC target genes 393 are also downregulated by p53 upon IR. In wildtype B cells, p53 suppresses the MYC oncogenic pathway 394 compared to Trp53 -/-B cells. The most significant shared TF binding motif is E2F1 (FDR < 7.73×10 -11 ). 395 This agrees with the role of p53 in cell cycle arrest through p21-mediated control of E2F factors [82]. 396 Pathway analysis of p53 data 397 Many of the above observations can be confirmed by using pathway analysis (Supp. samples. Figure 18 clearly shows that p53 signaling pathway, apoptosis, and positive regulation of cell 401 cycle arrest are uniquely activated by IR in wild-type B cells. This is again confirmed by TF target genes 402 ( Figure 19). In addition, the p53-independent upregulation of MYC target genes can also be observed in 403 Figure 19. Several ETS transcription factors, including SFPI1, SPI1, and ETS1, are suppressed by IR in 404 both cell types. These factors may underlie the suppression of immune response as suggested [83]. 405 Applying PGSEA on miRNA target genes highlights miRNA-30a (Supp. Figure 15), whose target genes 406 are specifically activated by IR in wild-type B cells. miRNA-30a was shown to be involved in response to 407 IR [84] and mutually regulate p53 [85]. Thus, the complex p53 signaling pathways are unveiled with 408 remarkable accuracy. 409 The upregulated p53 target genes can be seen in the KEGG pathway diagram (Figure 20). This pathway 410 map shows multifaceted roles of p53 in the regulation of apoptosis, cell cycle, DNA damage repair, and 411 growth arrest. Many of these functions were re-discovered in our analyses above. This shows the power of 412 comprehensive pathway databases coupled with broad analytic functionalities accessible via an intuitive 413 user interface. Without iDEP, it can take days or weeks to write code and collect data to conduct all the 414 analyses above. With iDEP, biologists can complete such analyses in as little as 20 minutes. 415 416 By integrating many Bioconductor packages with comprehensive annotation databases, iDEP enables users 417

Discussions and conclusions
to conduct in-depth bioinformatics analysis of transcriptomic data through a GUI. Taking advantage of the 418 Shiny platform, we were able to pack many useful functionalities into iDEP, including high-quality graphics 419 based on ggplot2 and interactive plots using Plotly. Compared with traditional web applications, Shiny has 420 its drawbacks and limitations. The interface is not as flexible as those developed using JavaScript. 421 Nevertheless, we believe an integrated web application like iDEP is a valuable tool to both bench scientists 422 and bioinformaticians. 423 As an example, we extensively analyzed an RNA-Seq dataset involving Hoxa1 knockdown by siRNA in 424 lung fibroblasts, and identified the down-regulation of cell-cycle genes, in agreement with previous 425 analyses and experimental confirmation. Our analyses also show E2F and SP1 binding motifs are enriched 426 in the promoters of downregulated genes, mediating the cell cycle arrest. Furthermore, we also find 427 evidence that microRNAs (miR-17-5P, miR-20a, miR-106a, miR-192, miRNA-193b, and miR-215) might 428 work together with E2F factors to block the G1/S transition in response to reduced Hoxa1 expression. 429 Interestingly, miR-106a is located in the intron of Mcm7, an E2F1 target gene. DEGs are also enriched with 430 genes related to neuron parts, synapse, as well as neurodegenerative diseases. This is consistent with reports 431 of Hoxa1's role in neuron differentiation [48][49][50]. Hoxa1 knockdown induces expression of genes 432 associated with the cytokine-cytokine interaction, lysosome, and cell migration, probably in response to the 433 injected siRNAs. These genes are overrepresented with target genes of NF-κB, known to be involved in 434 immune response. By combining both annotation dataset and analytic functionality, iDEP help biologists 435 to quickly analyze their data from different perspectives. 436 In the second example, our analysis shows that in B cell without p53, radiation treatment upregulates 437 MYC oncogenic pathway, triggering downstream genes with highly coherent functions such as cell 438 proliferation, ribosome biogenesis, and ncRNA metabolism. Enriched with target genes of miR-124, and 439 ETS domain transcription factors, genes downregulated by IR in p53 null B cells are associated with 440 immune response, GTPase activity and actin cytoskeleton. In wildtype B cells, a p53-dependent 441 transcriptional response to IR is evidently related to p53-mediated apoptosis and DNA repair, as expected. 442 The target genes of MYC and E2F1 are suppressed by p53, leading to growth and cell cycle arrest. iDEP 443 helps unveil the multifaceted functions of p53, and also highlight the potential involvement of several 444 miRNAs (miR-92a, miR-504, and miR-30a). 445 Users should be cautious when interpreting results from pathway analysis, which can be obtained through 446 the many combinations of methods and gene set databases. The biomedical literature is large and 447 heterogeneous [86], making it easy to rationalize and make a story out of any gene. True pathways, like the 448 effect of Hoxa1 knockdown on cell cycle, should be robustly identified across different methods and 449 databases. Also, as demonstrated in the two examples, for each enrichment or pathway analysis, we tried 450 to focus on the most significant gene sets. 451 Besides RNA-Seq and DNA microarray data, users can also use iDEP to analyze fold-change and FDR 452 values calculated by other methods such as cuffdiff [87]. For unannotated genomes, iDEP can only be used 453 for EDA and differential expression analysis. For single-cell RNA-Seq data [88], only smaller, pre-454 processed, imputed datasets with hundreds of cells can be analyzed, as iDEP is mostly designed to handle 455 transcriptomic data derived from bulk tissues. 456 In addition to updating the annotation database derived from Ensembl every year, we plan to continue to 457 compile pathway databases for model organisms, similar to MSigDB and GSKB. For currently unsupported 458 species, we will consider ways to incorporate user-submitted gene annotation. Based on user request and 459 feedback, we will also add more functions by including additional Bioconductor packages. 460

461
The author thanks Brian Moore and Kevin Brandt for technical support of the web server. iDEP is 462 strengthened by constructive criticisms from a previous reviewer, and many suggestions from users. 463

464
This work was partially supported by National Institutes of Health (GM083226), National Science 465 Foundation/EPSCoR (IIA-1355423) and by the State of South Dakota. 466