ADAGE signature analysis: differential expression analysis with data-defined gene sets

Background Gene set enrichment analysis and overrepresentation analyses are commonly used methods to determine the biological processes affected by a differential expression experiment. This approach requires biologically relevant gene sets, which are currently curated manually, limiting their availability and accuracy in many organisms without extensively curated resources. New feature learning approaches can now be paired with existing data collections to directly extract functional gene sets from big data. Results Here we introduce a method to identify perturbed processes. In contrast with methods that use curated gene sets, this approach uses signatures extracted from public expression data. We first extract expression signatures from public data using ADAGE, a neural network-based feature extraction approach. We next identify signatures that are differentially active under a given treatment. Our results demonstrate that these signatures represent biological processes that are perturbed by the experiment. Because these signatures are directly learned from data without supervision, they can identify uncurated or novel biological processes. We implemented ADAGE signature analysis for the bacterial pathogen Pseudomonas aeruginosa. For the convenience of different user groups, we implemented both an R package (ADAGEpath) and a web server (http://adage.greenelab.com) to run these analyses. Both are open-source to allow easy expansion to other organisms or signature generation methods. We applied ADAGE signature analysis to an example dataset in which wild-type and ∆anr mutant cells were grown as biofilms on the Cystic Fibrosis genotype bronchial epithelial cells. We mapped active signatures in the dataset to KEGG pathways and compared with pathways identified using GSEA. The two approaches generally return consistent results; however, ADAGE signature analysis also identified a signature that revealed the molecularly supported link between the MexT regulon and Anr. Conclusions We designed ADAGE signature analysis to perform gene set analysis using data-defined functional gene signatures. This approach addresses an important gap for biologists studying non-traditional model organisms and those without extensive curated resources available. We built both an R package and web server to provide ADAGE signature analysis to the community. Electronic supplementary material The online version of this article (10.1186/s12859-017-1905-4) contains supplementary material, which is available to authorized users.


Background
High-throughput genome-scale measurements are now widely used because they can provide a global view of a biological system. Typical experiments involve a control and some sort of treatment, and the typical output is a list of genes with expression levels that were significantly altered. In addition to examining genes of interest individually, researchers often summarize results with gene set overrepresentation analysis (also called pathway analysis) to infer the biological basis of the gene list. These analyses aim to link groups of perturbed genes by their biological themes and help researchers understand the effect of an experiment on biological pathways.
Two primary components comprise gene set analyses: a testing algorithm and pre-defined sets of biologically themed genes. While the first part has been extensively explored [1][2][3], the second part has not drawn much attention because the creation and maintenance of gene sets remains a largely manual process requiring substantial curator effort. Currently, gene sets are primarily contributed by consortia of curators, such as the GO consortium [4,5]. Manual annotation ensures the quality of the gene sets but is slow, can be tedious, and leads to gene sets with certain biases [6]. Furthermore, while a small set of primary model organisms has received substantial curator effort, other organisms remain sparsely annotated. Accurately transferring annotations across organisms using computational prediction algorithms remains challenging, particularly for biological processes [7]. Due to the limited availability and sparse coverage of gene sets, the potential of gene set analysis remains limited for most non-traditional model organisms.
In contrast to the paucity of carefully curated gene sets specific to these non-traditional models, the amount of genome-wide gene expression data has grown rapidly, especially for microbes which have a relatively small transcriptome and are inexpensive to assay [8]. For single-cell organisms, a complete compendium of public data ideally captures expression under numerous conditions. We may expect these compendia to extensively characterize many of the organism's transcriptional regulatory processes and to be well-suited targets for the extraction of pathway-like signatures. There are resources, such as MSigDB [9], that have been generated by the computational analysis of curated and labeled experiments. But such labeled compendia are often also unavailable in the same settings where curated gene sets are limited: MSigDB only contains gene sets for Homo sapiens, Macaca mulatta, Mouse, Mus musculus, Rattus norvegicus, and Danio rerio.
We previously developed ADAGE, an algorithm that extracts meaningful gene sets from unlabeled genomewide gene expression compendia [10]. ADAGE models are unsupervised neural network models of large publicly available gene expression compendia. Specifically, ADAGE models are denoising autoencoder neural networks [11,12], which means that they are trained to take input where noise has been intentionally added and remove the added noise. This process of noise addition and removal has been shown to make robust unsupervised models that capture the essential features of the underlying data [12]. We built these models from gene expression data and found that the genes that most influence each node formed gene sets that resembled human-annotated biological processes and pathways, likely because these processes and pathways are often coexpressed across large gene expression compendia [10,13]. Though the model is unsupervised, the highest weight genes for each neural network node can often be interpreted either in the context of curated gene sets or by biologists who recognize biological commonalities among genes. We have termed each high-weight set an ADAGE signature. Additional approaches can be layered on top of this base strategy: for example we developed eADAGE, which summarizes multiple ADAGE models into an ensemble model, which covered more biological pathways more precisely [13]. Subsequent interpretation of the eADAGE model revealed that, beyond curated pathway sets, eADAGE also extracted signatures that grouped genes that match known but uncurated pathways suggesting that some others may capture biological processes that have not yet been fully described.
To fully leverage signatures built by eADAGE or other robust feature construction approaches, we introduce an ADAGE signature analysis pipeline. ADAGE signature analysis aims to identify one or more signatures that respond to an experimental treatment. As with gene set analyses, these signatures represent biological processes that may be perturbed by the treatment. The approach is similar to traditional gene set analysis but replaces human-annotated gene sets with ADAGE-learned signatures. ADAGE signature analysis complements pathway-style analysis in any organism by providing an unsupervised perspective, and is usable for nontraditional model organisms or other organisms for which curated pathways are unavailable. Here we demonstrate ADAGE signature analysis in the bacterial pathogen Pseudomonas aeruginosa. We chose P. aeruginosa as our model organism because it has sufficient public gene expression data to construct a model and a dedicated research community. Though sparsely curated in the recent past, its pathway annotations have been growing rapidly due to a community annotation initiative [14]. This allows us to validate the biological relevance of gene signatures learned by ADAGE, while also demonstrating its ability to identify as yet unannotated biological processes. To facilitate the use of ADAGE signature analysis, we developed both an R package for users with bioinformatics background and an easy-touse web server intended for use by bench biologists.

ADAGE signature analysis workflow
ADAGE signature analysis has three major steps: data preparation, active signature detection, and signature interpretation (Fig. 1). In addition to the input dataset to be analyzed, ADAGE signature analysis also requires an ADAGE model and the gene expression compendium of an organism from which the model was built. The P. aeruginosa compendium and (e)ADAGE models can be built following instructions in [10,13].

Data preparation
Gene expression data must be normalized to be compatible with the ADAGE model. Raw microarray data measured on the same chip platform as the compendium are analyzed with Robust Multi-array Average (RMA) [15]. RMA includes background correction, quantile normalization, and probe summarization; however, quantile normalization is performed against the quantile distribution of the compendium so that the resulting expression values are comparable with the compendium. Other types of gene expression data must first be processed into gene-level expression values. Gene identifiers used in the input data are mapped to the gene identifiers used in the compendium. We next impute the expression of missing genes using k-nearest neighbors -the neighbors are computed based on similarity in the compendium. For processed microarray data, we apply quantile normalization using the compendium's quantiles. For RNA-seq data, expression values are normalized to the compendium via TDM [16]. The last step in data preparation for all types of input data is a zero-one linear transformation using the compendium as reference. Measurements outside the range observed in the Fig. 1 ADAGE signature analysis workflow. An input dataset is first processed to be compatible with a provided ADAGE model trained on the provided expression compendium. Raw microarray measurements are transformed by RMA, while processed data first have missing values imputed and then are normalized to the compendium. The second step is detecting active signatures in the dataset. Signature activity for each sample in the dataset is computed. Statistical test on the signature activity is used to identify signatures respond to an experimental treatment. Redundant signatures are further removed. The last step is interpreting the biological meaning of active signatures through evaluating genes in the signatures, pathways associated with the signatures, datasets related to the signatures, and a gene-gene network built upon the signatures compendium are set to zero or one according to whether they are below or above the range. After processing, the dataset is ready for ADAGE signature analysis.

Active signature detection
The concept of ADAGE signature was first introduced in [13]. To recap, in an ADAGE model, genes connect to nodes via weights and this vector of weights characterizes each node (Fig. 2). The distribution of the weight vector centers near zero and is close to normal, with a small number of genes contributing high weights. As opposed to a simple set of genes, the signature is a gene set with a weight value for each gene that indicates the gene's importance to the set. We group genes with weights more extreme than 2.5 standard deviations from the mean on the positive side and the negative side of the weight distribution separately. These two gene sets form the positive signature and negative signature of a node and are simply named as "NodeXXpos" and "NodeXXneg". We elected to split these signatures because we were uncomfortable making a strong assumption that the positive and negative signatures of the same node must always directly share transcriptional regulatory programs. For simplicity and because the positive and negative sides of the distribution are already separated, we ignore the signs of weight values.
After data preparation, we calculate each signature's activity for every sample in the input dataset. The signature's activity reflects how active that signature is in each sample and is defined as the average expression values of signature genes weighted by genes' absolute weights in the signature. This results in a matrix of activity values, where each row is a signature and each column is a sample, which can be shown in an activity heatmap (Fig. 1). To detect signatures associated with an experimental treatment, we apply statistical tests to signature activities. The most appropriate statistical test depends on the experimental design. After applying the selected test, we can identify a list of the signatures with the greatest changes in activity in response to a particular experimental treatment.
We have observed that multiple signatures can share many genes. Signature overlap may be due in part to the random noise added during training; however, we cannot rule out the possibility that the subtle differences between two overlapping signatures are biologically meaningful. Therefore, we must carefully handle signature overlap to remove redundant signatures but retain those relevant to a specific dataset. To accomplish this, we calculate marginal activities of every combination of two signatures. The marginal activity for a signature pair is the activity of one signature after removing genes that it overlaps with the second signature. After removal, we test whether the marginal activities still respond significantly to the treatment. We ignore signatures when their activity is no longer significant after removing the effect of another signature, as long as the other signature is not also removed through this process. In a special case where a group of signatures all become non-significant after removing each other, we keep the one that is most significantly altered. This process results in a final list of signatures affected by an experimental treatment.

Signature interpretation
It is important to note that a benefit of ADAGE signature analysis, as opposed to attempting to interpret the entire set of signatures, is that investigators only need to examine signatures that are affected by their experiment. Signatures are gene sets formed based on the expression patterns in a gene expression compendium. They are not annotated to a specific biological process, but we have used several strategies to help understand the biological meaning behind a signature. The most intuitive way is to examine its gene composition. If some genes have been previously characterized and share a biological theme, they may suggest the biological process represented by the signature. Users can also link existing In an ADAGE model, every gene is linked to every node through an edge. The edge weight is fixed after model training and its magnitude is reflected by edge thickness. The distribution of gene weights to a node is centered at zero and close to normal. Genes giving weights higher than the positive high-weight (HW) cutoff together form the positive gene signature for that node (genes in the orange circle). Similarly, genes giving weights lower than the negative HW cutoff together form negative gene signature for that node (genes in the blue circle) curated biological knowledge such as KEGG [17] pathways and GO terms [18] to signatures through an enrichment test, though such annotations are not always available. Even when they exist, annotations are not expected to be comprehensive for non-traditional model organisms. Finally, users can probe a signature by analyzing the compendium and extracting experiments in which the signature has its largest activity ranges. We expect these to be experimental conditions in which the biological processes represented by the signature are perturbed. Taken together or individually, these steps allow researchers to quickly interpret signatures of interest without curating a complete collection of pathways.
To visualize which genes are in signatures and how they are related across the overall model, we construct a gene-gene network using genes in selected signatures of interest. The network is built upon gene-gene relationships extracted from an ADAGE model. In the network, two genes are linked by an edge if the correlation between their weight vectors, i.e. how strongly connected they are to each node, is higher than a tunable cutoff. Depending on how they are linked with each other, genes can form modules in the network. These modules highlight functional units of genes in differentially active signatures. The network can be interactively explored and its use is facilitated by overlaid information, such as gene descriptions, differential expression in the experiment being analyzed, and annotations for each gene from GO and KEGG where available.

User interface
There are two ways for users to access ADAGE signature analysis. We provide an R package, intended for computationally inclined users and a web server intended for those without familiarity with the R programming language. The R package and the web server are both preloaded with a Pseudomonas aeruginosa gene expression compendium containing microarray samples measured on the Pae_G1a Affymetrix Pseudomonas aeruginosa array that were available on the ArrayExpress database [19] before July 31 2015, a previously published eADAGE model built on this compendium [13], and P.a. gene information retrieved from NCBI's ftp site. Both are open source and licensed freely, so that investigators can add their own machine learning models and additional organisms. We also plan to expand both resources to include additional nontraditional model organisms.

Web server
We developed a web server that implements the most central components of ADAGE signature analysis. The web server is designed with a clean separation between a backend API and a JavaScript application frontend. This allows programmatic access to the server if desired. To briefly introduce the software stack, the backend is written in Python using the Django framework. The frontend is implemented in AngularJS with Vega and D3 used to provide interactive visualizations. Both the backend and frontend are available under the permissive 3clause BSD open source license. Advanced users can initialize their own instance of the ADAGE web server, load models of their choosing, and supply this interface to users. We document a deployment process, via Docker containers, that users can follow to initialize their own server along with sample input files. This process is detailed in the README file of the https:// github.com/greenelab/adage-server repository.
Our public instance of the ADAGE web server is hosted on Amazon Web Services. Here we describe the main features provided by the web server. Users first need to choose a machine learning model on the homepage and all the following analyses are model specific. Then users can explore assays and experiments and perform signature analysis (Analyze), explore genes' similarities in the model through a gene-gene network (GeneNetwork), explore signatures in the model (Signature), and obtain annotations for the underlying sample compendium (Download).
The Analyze feature guides users to perform the entire ADAGE signature analysis pipeline (Fig. 3). To begin the analysis, users first search experiments or samples with a keyword. Next users identify experiments or samples of interest and add them to the analysis basket. Clicking the analysis basket takes users to the Sample page, where sample information is listed. A heatmap is plotted showing signature activities in all samples and can be clustered by sample or signature. To compare two groups of samples, users assign samples to either a treatment (yellow) or a control (purple) group. Then a differential activation test between the two groups is performed and its result is presented in a volcano plot showing difference in mean activity in the x-axis and significance p-value in the y-axis. Signatures with positive activity differences are active in the comparison group. Users choose signatures that are highly differentially active and examine them either in the Signature page or the GeneNetwork page described below.
The GeneNetwork feature allows users to investigate a gene or a group of genes in an ADAGE-derived genegene network. Users input genes of interest from the organism associated with that machine learning model. A network including input genes and their connecting genes drawn from the machine learning model will be shown. The default view presents genes connected by an absolute edge correlation higher than 0.5. Users can adjust the correlation range to examine more strongly or weakly connected genes. Another way to obtain genes of interest is from a two-group comparison through the Fig. 3 ADAGE web server interface and analysis workflow. The ADAGE web server interface has six tabs on the top. The signature analysis pipeline starts with choosing a machine learning model on the homepage (step1) and choosing to explore assays and experiments (step2). In the Analyze page, users can search datasets or samples using keywords (step3). After clicking the "+" button beside each dataset or sample, the dataset with all samples in it or an individual sample is added into the analysis basket (step4). After users add samples, clicking the basket brings users to the Sample page (step5). The Sample page provides experimental information about each sample and a signature activity heatmap. Users can define a two-group comparison by selecting samples and assigning them to either treatment (yellow) group or control (purple) group (step6). The signatures that are differentially active between two groups of samples can be examined in a volcano plot (step7). Next users select signatures in the volcano plot (step8) and further inspect them in the Signature page (step9). The Signature page provides information about gene composition, gene set association, and related datasets of a signature. Lastly, users can visualize their interested signatures in a gene-gene network (step10). Users can also directly examine genes in the gene-gene network through the GeneNetwork tab and examine a signature through the Signature tab Analysis feature as described above. If the gene-gene network results from this process, then the color of each gene in the network reflects gene expression fold change. Otherwise, the node color is grey. Gene information such as names and function description are provided when clicking on a node. Signatures shared by two connecting genes are shown when clicking on an edge.
The Signature feature helps users to interpret the biological meaning of an ADAGE signature. After choosing a machine learning model, users can directly examine a signature in the model. The Signature page lists each gene in a signature and its known functions. It also presents a table of GO and KEGG pathways associated with the signature. To help identify other public datasets that have the signature active, it also shows experiments with the highest activity range for that signature.
The Download feature allows users to download sample annotations. These annotations are manually curated experimental information for each sample in the training compendium [13].

R package
We built an R package called ADAGEpath to perform ADAGE signature analysis. It is written exclusively in R using the devtools package [20] and is available on github (https://github.com/greenelab/ADAGEpath) under the BSD-3-clause license. The R package provides functions for loading in public or user-defined datasets either from a local machine or directly from ArrayExpress via an accession number. The recommended input format for Affymetrix microarray data is a set of raw CEL files, which can be directly processed from the probe level with the help of the affy [21], affyio [22], and preprocessCore [23] packages from Bioconductor. For other data types, information can also be loaded via the load_dataset() function. We provide a vignette demonstrating how RNA-seq data can be loaded into the package (https://github.com/greenelab/ADAGEpath/blob/master/vignettes/ RNAseq-example.Rmd). This vignette implements data loading and transformation with the TDM package [24] followed by linear transformation to the 0 to 1 range.
We also provide functionality to calculate each signature's activity for each sample in the dataset. After this step, limma [25] is applied to test differential activation. For the most frequently used two-group comparison we provide a build_limma() wrapper function. This provides helper functions for visualization such as plot_volcano() and another helper, get_active_signatures(), to retrieve signatures based on three potential criteria: filtering by significance, sorting by absolute activity difference, and optimizing significance and activity difference simultaneously through choosing signatures from the top Pareto fronts. In multi-objective optimization, if there is no other solution that outperforms a specific solution over all objectives, that solution is said to be Pareto optimal. All such solutions make up the Pareto front. We also provide helper functions to probe, visualize, and potentially remove redundant signatures for two-group comparisons. These functions (calculate_marginal_activity, plot_marginal_activation, plot_signature_overlap, and remove_redundant_signatures) can be applied in series to identify those signatures most affected by a treatment.
Limma also supports many types of experimental designs, such as factorial design and time-course experiments. We provide examples of analyzing a time-course experiment and a factorial-design experiment using limma in the package vignettes. Users can also apply other statistical tests to identify activated signatures if desired.
We also provide helper functions for signature interpretation. These include annotate_genes_in_signatures(), which will return the genes that comprise a signature, and annotate_signatures_with_genesets(), which will download GO terms and KEGG pathways [17] from the TRIBE web server [26] and use them to annotate signatures. The TRIBE web server also allows users to build and share their own custom gene sets, so the connection to this resource enables custom gene set analysis as well. Users can also visualize signatures via gene-gene networks using the function visualize_gene_network(). The network is built with the help of the R package igraph [27] and made interactive with the package visNetwork [28].

A comparison between two user interfaces
The web server and R package target users with different backgrounds and needs. For bench scientists, the web server is straightforward to use and does not require familiarity with any programming language. For bioinformaticians, the R package provides more flexibility. It allows programmatic access and integration with other analysis pipelines. Powered by rich statistical resources in R, the R package tests the differential activation using the more robust linear model (provided by the limma R package). The web server currently supports the twosample t-test for differential activation. Therefore, it's normal to get slightly different results from the two platforms. To avoid security and privacy issues with data storage and management, the web server currently supports only public datasets. The overlapping signature removal function is not available in the web server at this time. Users who wish to automatically filter overlapping signatures or analyze datasets that are not publicly available should use the R package.

Results
Here we demonstrate ADAGE signature analysis on an example dataset (GSE67006). This dataset contains wildtype and Δanr mutant grown as biofilms on the Cystic Fibrosis genotype bronchial epithelial cells (CFBE) in order to model cystic fibrosis airways infections. Anr is a transcriptional regulator responsible for the aerobic to anaerobic transition [42]. We performed ADAGE signature analysis to identify biological processes that were affected by Anr on CFBE cells. The script that reproduces the following analysis is available on Github (https:// github.com/greenelab/SignatureAnalysis-CaseStudy).
We first ran a two-group limma test between wildtype samples and mutant samples to detect signatures with significantly different activities. We visualized test results as a volcano plot with activity difference on the x-axis and test significance on the y-axis (Fig. 4a). Many signatures passed the 0.05 significance cutoff. To focus on the most differentially active signatures, we considered both activity differences and statistical significance by selecting signatures that were on the top 10 layers of Pareto fronts. This resulted in 36 signatures and their activities in each sample were visualized with a heatmap in which yellow indicated high activity and blue indicated low activity (Fig. 4b).
Because signatures could overlap in their gene compositions, we evaluated the effects of ADAGE parameters on overlap. We observed that signatures tend to have more overlapping genes when the model is trained with  Figure S1A, B). For each overlap, we cannot rule out the possibility that subtle differences are biologically meaningful. Supporting the potential biological distinctions between similar signatures, models trained with more noise also cover more biological pathways within a reasonable corruption range (0% to 25% corruption) (Additional file 1: Figure S1C). For this reason, we employed a data-driven process to remove overlapping signatures before performing detailed investigation of specific signatures.
We calculated the marginal activity for each combination of signature pairs and tested its significance in differentiating the wild-type and deletion strains. Figure 4c shows the test significance (adjusted p-values in the -log10 scale) when the signature in each column was removed from the signature in the row and the diagonal shows the significance of each signature. A cross sign indicates non-significant p-values. We define a signature to be redundant if it becomes non-significant after removing the effect of another signature. Following these rules, we dropped the following signatures: Node119pos, Node214neg, Node299pos, Node130pos, Node250neg, Node154pos, Node63neg, Node39neg, Node228pos, Node158neg, Node140pos, Node269neg, Node31neg, Node185neg, Node275pos, Node278pos, and Node285neg. Interestingly, Node34pos and Node28neg shared many genes (Fig. 4d), but they each contained additional genes and both remained significant in the marginal activation test. It is important to note that this approach considers signature overlap in the context of a specific dataset: in a different dataset, different overlapping signatures might be removed if they were more significantly associated in that context. At this stage, there were 19 differentially active signatures remaining. These were visualized together in a gene-gene network (edge correlation cutoff = 0.5) (Fig. 5a); the edges between genes in this network revealed sets of genes that have transcriptional relationships as detected in the ADAGE model. Network modules, discrete clusters within the network, can reveal regulons. For example, the sets of genes involved in denitrification (Fig. 5c).
We next performed analyses designed to suggest the biological basis of the activated signatures. We linked signatures to KEGG pathways through enrichment tests. Twelve of the nineteen signatures were significantly enriched for one or more of 14 KEGG pathways (Table 1, pathway association significance). We then tested whether these KEGG pathways were differentially active between wild-type and Δanr mutants using only genes shared by signatures and their associated pathways. Genes in seven of the 14 KEGG pathways were significantly activated (adjusted p-value <=0.05) ( Table 1, pathway activation significance). As a comparison, we also performed the popular gene set enrichment analysis (GSEA) [9] and considered pathways with FDR q-values lower than 0.05 in GSEA's permutation test (Table 2). Five pathways were detected by both GSEA and ADAGE: Type VI secretion system; Cytochrome c oxidase, cbb3-type; Nitrogen metabolism; Denitrification, nitrate = > nitrogen; Biosynthesis of siderophore group nonribosomal peptides. These pathways have been shown to be regulated by Anr [43]. Eight pathways were only detected by GSEA (Table 2, pathways with white background). Many are large pathways including the Ribosome and bacterial secretion system pathways (size of 56 and 90 respectively). GSEA has been found to be biased towards large gene sets [44]. Three of the GSEA-only pathways were associated with signatures that nearly met the signature selection criteria (Additional file 2: Figure S2), and two were not associated with any ADAGE signatures. Nine KEGG pathways were not significantly enriched in GSEA but were associated with active ADAGE signatures (Table 1, pathways with white background). Among them, two pathways (Cyanoamino acid metabolism and Iron complex transport system) were considered activated by ADAGE signature analysis and also achieved high enrichment scores in GSEA ( Table 2). The other seven pathways did not appear activated in this dataset. This primarily occurred when one signature was associated with several KEGG pathways, though sometimes only one or two of the pathways were strongly activated in this dataset. The fact that several pathways are grouped into one signature indicates that their expression frequently co-varies across the compendium. Therefore, pathways that are associated with active signatures but are not active themselves in a dataset should be viewed critically. They could be co-regulated pathways whose activities did not peak at the time of experiment or pathways that are co-regulated but only under conditions not relevant to this experiment.
ADAGE built not only signatures resembling existing pathways but also novel signatures that are unavailable in traditional pathway analysis. Seven differentially active signatures were not enriched for KEGG pathways, including Node35pos (Fig. 5b), the most active signature in the Δanr mutant (Fig. 4a). We also attempted to interpret Node35pos using GO terms but, as with our KEGG analysis, found it associated with no existing GO terms. Node35pos contains genes mexEF and oprN, which encode multidrug efflux protein, and many uncharacterized genes. The majority of the genes in Node35pos were highly expressed in Δanr mutants. To examine whether or not Node35pos captured a regulatory module, we analyzed the three most highly differentially expressed uncharacterized genes (PA4881, PA3229, and PA2486) in the STRING network [45] and found that they all returned similar networks that were subsets of Node35pos (Additional file 3: Figure S3). Interestingly, these STRING networks were primarily built upon text mining. The reference papers used by STRING that performed a transcriptional profiling either did not release data [46] or used RNA-seq [47] and thus they were not included in the eADAGE model training set. Therefore, we concluded that signature Node35pos captured a transcriptional process, which was independently supported by the STRING network. MexEF-OprN is known to be regulated by MexT [48], which is connected to genes in Node35pos but is not in Node35pos itself (Fig. 5b). Further examination showed that genes in Node35pos overlapped the MexT regulon (FDR q-value of 8.7e-23, MexT regulon obtained from CollecTF database [49]). The link between MexT and Anr has not been explicitly studied before. In strains lacking anr, the expression of mexT and MexT-regulated genes was higher. Because it has been shown that the MexT regulon, including the mexEF-oprN operon, is induced in response to nitrosative stress [50], we predict that the lack of the Anr-regulated denitrification genes nar, nir, and nor reduced detoxification of endogenously generated reactive nitrogen species [51][52][53], thereby activating MexT. The strong up-regulation of MexT, a redox-responsive regulator, may also be more active in Δanr mutant due to other changes in intracellular redox [54].
Through examining the overlapping genes in the seven uncharacterized signatures, we divided them into two groups (Additional file 4: Figure S4). Group 1 contains MexT regulatory programs as represented by Node35pos. Group 2 contains many quorum sensing controlled genes, which are lower in the Δanr mutant. Interestingly, the visualization of these pathways in output generated by this tool prompted the examination of connections between the MexT regulon and quorum  Table 1 Signatures selected by ADAGE signature analysis for the example dataset and their associated KEGG pathways The significance of a pathway's association with a signature and the significance of the pathway's activity difference between Δanr mutant and wild-type are provided respectively. Adjusted p-values were rounded off to three decimal digits. A star indicates a significantly active pathway with adjusted p-value < = 0.05. Pathways detected both by ADAGE signature analysis and GSEA (Table 2) are colored correspondingly sensing. Indeed, high expression of mexEF-oprN is associated with decreased quorum sensing due to the efflux of the QS molecule HHQ [55]. At the time we retrieved KEGG pathways and performed this analysis, signatures in Group 2 were still uncharacterized in KEGG. An updated analysis showed that signatures in Group 2 were now associated with KEGG pathways quorum sensing and phenazine biosynthesis, which were added to KEGG on 8-1-2016 and 3-27-2017 respectively. Signatures in Group 1 were still uncharacterized in the updated analysis. Quorum sensing and phenazine biosynthesis have been studied for a long time in P. aeruginosa with many well-characterized genes. The time lag in their annotation hinders their usage in traditional pathway analysis, yet ADAGE identified them directly from public data and grouped them into signatures. This again highlights the strength of ADAGE-based signature analysis: it does not rely on pre-defined pathways but uses regulatory patterns directly extracted from large compendia of gene expression data.

Discussion
Researchers performing ADAGE signature analysis reverse the steps of traditional gene set analysis: they first identify signatures with statistically significant differential expression patterns before attempting interpretation. These researchers then only need to focus on signatures relevant to their experiments. This is important for organisms with incomplete or absent gene sets, because Table 2 GSEA result for the example dataset Adjusted p-values were rounded off to three decimal digits. Pathways significantly enriched (FDR q-val < = 0.05) are highlighted by stars. Pathways detected both by ADAGE signature analysis (Table 1) and GSEA are colored correspondingly the alternative would be to curate every possible gene set before performing any analysis. The lag from discovery to gene set annotation also hinders the application of traditional gene set analysis to organisms with available curated resources. Our analysis revealed that even well characterized process, e.g. quorum sensing, can lack annotation in some resources and thus would not be detectable in traditional gene set analysis. However, biologists working in a field can often readily identify and interpret these differentially active ADAGE signatures. ADAGE signatures may also be comprised entirely of uncharacterized genes. Though such signatures would be difficult to interpret, they may represent novel biological processes. Thus ADAGE signature analysis is well suited to hypothesis generation in organisms about which little is known.
Constructing high-quality signatures in an unsupervised manner requires two key components: sufficient data and suitable feature extraction algorithms. The ideal data compendium should be a broad survey of an organism probed under many conditions. Signature analysis is unlikely to detect pathways that have never been perturbed in a compendium, and a heavily biased compendium would result in limited detection of biological processes. Though it is difficult to directly measure data comprehensiveness, both data quantity and a broad set of contributing research groups are expected to positively correlate with comprehensiveness. Quantity is important because more overall conditions are likely to have been measured, and the number of contributing research groups is important because they are likely to be studying different aspects of an organism's biology. As genome-wide measurements continue to grow, we expect such methods to be more broadly applied to reveal perturbed biological processes and pathways.
Good feature extraction algorithms are also needed to best utilize the available data. Many feature extraction approaches have been applied on biological data, such as PCA [56][57][58], ICA [59][60][61], and NMF [62][63][64]. We previously developed ADAGE, a neural network-based approach, and found it to outperform PCA and ICA in representing biological states [10] and capturing KEGG pathways [13] in P. aeruginosa. However, it is still challenging to comprehensively evaluate the "effectiveness" of features built by different approaches, especially for less-studied organisms. Because every method has its own underlying assumptions and objectives, we expect them to learn different types of features and complement each other. The concept of gene set analysis with datadefined gene sets is not limited to ADAGE signatures. Future work will focus on expanding this analysis pipeline to more feature types and providing support for more organisms.

Conclusions
Gene set analysis has been a powerful tool for interpreting the results of high throughput experiments. However, curating biologically important gene sets requires tremendous effort. As a result, many organisms, including those that are widely used to study certain processes, remain unannotated. In contrast with the sparsity of gene set annotations, high-throughput experimental data has accumulated rapidly. Unsupervised analysis of these data can reveal gene sets that parallel human-curated biological pathways. We introduced ADAGE signature analysis, a gene set analysis method powered by signatures built directly from expression data by ADAGE or other unsupervised feature construction methods. The approach includes three major steps: data preparation, differentially active signature detection, and signature interpretation. We compared this approach with GSEA on an example dataset and observed that ADAGE signature analysis and GSEA detected similar curated KEGG pathways. However, ADAGE signature analysis also identified a novel regulatory relationship unannotated in KEGG. This result highlights the advantage of ADAGE signature analysis: it does not depend on curated knowledgebases but instead the breadth of existing public data. ADAGE signature analysis is implemented in an R package and a web server for users with different backgrounds and needs. For those without a specific dataset to analyze, we also provided a gene-gene network view to explore transcriptional regulatory modules learned by ADAGE.

Additional files
Additional file 1: Figure S1. The relationship between corruption level used in building ADAGE models and the redundancy of signatures derived from the models. The plot summarized results from 100 ADAGE models built at each corruption level. A: The number of signature pairs with gene compositions significantly overlapped increases with corruption level until the corruption level reaches 30%. B: As corruption level increases, the number of signatures in a model that enriched of the same KEGG pathway also increases on average, indicating the signatures become more redundant. C: ADAGE models tend to capture more unique KEGG pathways (pathway coverage) when more noise was added during training until the corruption level is higher than 25%. (TIFF 903 kb) Additional file 2: Figure S2. KEGG pathways enriched in the GSEA analysis but not associated with selected signatures in the ADAGE signature analysis. In the same volcano plot as Fig. 4a, signatures associated with GSEA-only pathways are highlighted in blue while signatures lie on the first 10 Pareto fronts are highlighted in Red. Node137pos, Node252neg, and Node113neg obtained high significances in the activation test and would be considered if we lose the activation cutoff. Pathways Tryptophan biosynthesis, chorismate = > tryptophan (KEGG-Module-M00023) and Aminoacyl-trna biosynthesis (KEGG-Pathway-pae00970) are not associated with any signature, so they were not labeled in the plot. (TIFF 746 kb) Additional file 3: Figure S3. Validation of Node35pos as a transcriptional program via the STRING network. A: The largest connected module of the gene-gene network subset by genes in Node35pos. B: Gene-gene networks returned by STRING when searching PA2486, PA3229, and PA4881