FunSet: an open-source software and web server for performing and displaying Gene Ontology enrichment analysis

Background Gene Ontology enrichment analysis provides an effective way to extract meaningful information from complex biological datasets. By identifying terms that are significantly overrepresented in a gene set, researchers can uncover biological features shared by genes. In addition to extracting enriched terms, it is also important to visualize the results in a way that is conducive to biological interpretation. Results Here we present FunSet, a new web server to perform and visualize enrichment analysis. The web server identifies Gene Ontology terms that are statistically overrepresented in a target set with respect to a background set. The enriched terms are displayed in a 2D plot that captures the semantic similarity between terms, with the option to cluster terms via spectral clustering and identify a representative term for each cluster. FunSet can be used interactively or programmatically, and allows users to download the enrichment results both in tabular form and in graphical form as SVG files or in data format as JSON or csv. To enhance reproducibility of the analyses, users have access to historical data for the ontology and the annotations. The source code for the standalone program and the web server are made available with an open-source license.


Background
Gene Ontology (GO) [1] enrichment analysis represents an effective way to tame the complexity of biological datasets and to facilitate their interpretation. The underlying idea is to identify sets of GO terms that are statistically overrepresented in a gene set of interest (e.g., a set of differentially expressed genes in an RNA-seq experiment or a set of genes associated with a trait in a genome-wide association study).
In order for GO enrichment analysis to be of value to biologists and biomedical researchers, it is important to have access to tools that allow users to perform the analysis and effectively display and interact with the results. Reproducibility of the results is another critical requirement in GO enrichment analysis, as it has been shown that *Correspondence: dghersi@unomaha.edu 1 School of Interdisciplinary Informatics, College of Information Science & Technology, University of Nebraska at Omaha, 1110 S 67TH, 68182 Omaha, NE, USA the GO controlled vocabulary is significantly changing over time, in ways that affect the results of the analyses [2].
Here we present FunSet, a new web server for performing GO enrichment analysis on gene sets and interactively displaying the results. The tool allows users to optionally cluster the results using a spectral clustering algorithm and to extract representative terms for each cluster. In addition to these features, FunSet enables users to choose previous versions of the GO vocabulary and corresponding annotations. The goal of this "time machine" feature is to foster reproducibility of GO analyses, which -as mentioned above -have been shown to be sensitive to the version of the ontology and annotation used [2]. A comparison of FunSet with existing GO enrichment analysis tools is shown in Table 1.
FunSet can be used programmatically with an API or from the command line. The source code for the entire pipeline (including the web server) is made available with an open source license. In summary, the contribution of FunSet are: (1) "time machine" feature that allows users to use GO historical data for reproducibility; (2) interactive visualization with clustering of terms and automatic identification of an optimal number of clusters and representative terms; (3) availability of the source code for both the command line programs and the web interface, enabling users to extend the pipeline or incorporate it into other existing pipelines.
A description of the implementation follows.

Gene sets
Enrichment analysis requires a target set (i.e., genes with a property of interest) and a background set. The user is required to enter the target set either as a commaseparated list in a text box or by uploading a text file. Optionally, the user can also upload a background gene set. Otherwise, by default FunSet will select as background all annotated genes for the chosen organism. The accepted format for specifying genes is HGNC symbols [9] for human, VGNC symbols [9] for cow and dog, and MOD (model organism databases) symbols [10] for model organisms.

FDR threshold
FunSet handles multiple comparisons using the Benjamini-Hochberg procedure [11]. The user has the option to enter a specific False Discovery Rate (FDR) threshold to filter the results; otherwise, FunSet uses the default threshold of 0.05.

Ontology version
In order to facilitate the reproducibility of published results, FunSet allows the user to select historical versions of the GO controlled vocabulary and organism annotations.

Enrichment analysis
The per-term enrichment analysis is performed using the hypergeometric distribution, which models sampling without replacement: where P(X ≥ k) is the probability of observing at least k genes with a given GO term, N is the total number of genes in the background set, K is the total number of genes annotated with the given term, n is the total number of genes in the target set, and x is the total number of genes in the target set annotated with the given term.

Clustering of terms
FunSet can also perform clustering of significantly enriched terms, in order to identify semantically similar groups of terms. The first step involves computing the semantic similarity between all pairs of enriched terms using the Aggregate Information Content (AIC) [12], an index that takes into consideration the information content of all ancestral terms of a GO term in the graph. The AIC index has been shown to perform better than other widely used measures of semantic similarity [12].
In the command line version of the program the user can also choose to use the Lin Index [13]. The Lin Index ranges from 0 (semantically unrelated terms) to 1 (semantically identical terms), and is computed as follows: where c ∈ S and S is the set of Lowest Common Ancestors (LCAs) of the two terms with the maximum Information Content (IC). The IC of a term t i is calculated as: where p t i is the probability of the term t i , calculated as the number of genes annotated with t i or with an ancestor term of t i divided by the total number of annotated genes. A matrix containing the pairwise semantic similarity between all enriched terms is then created and used to cluster the terms with the spectral cluster algorithm implemented in the scikit-learn [14] Python package, using default parameters and the desired number of clusters provided by the user. If the user does not specify a desired number of clusters, FunSet will estimate an optimal number using the eigengap strategy proposed by von Luxburg [15].
Finally, FunSet selects the medoids of each cluster, i.e., the terms with the largest average semantic similarity with respect to all terms in the cluster, as cluster representatives.

JavaScript Object Notation Application Programming Interface (JSONAPI)
FunSet is, at its core, a RESTful web service that meets the JSONAPI standard [16]. JSONAPI is a prescriptive format and protocol that sits on top of HTTP and promotes welldefined multi-platform interoperability by eliminating the need for ad-hoc code to be defined on a per-application basis. FunSet uses JSONAPI as a means to execute an analysis pipeline, translate the analysis data into a webserialized format, and to pipe it to a frontend web visualization interface, described below in a later section. In addition, the FunSet web service also exposes its underlying capabilities publically, allowing users to programmatically invoke the enrichment and clustering process and receive results as raw JSON.

API endpoints
The FunSet API is organized around a set of API endpoints that can be invoked programmatically using a REST client, such as POSTMAN [17], using any http command line tool, such as CURL [18], or via the web using the visualization client application. The endpoints it provides are documented below. Each endpoint accepts HTTP GET and/or POST requests. Endpoint documentation below uses the following notational syntax: where the parenthetical, (), denotes a pattern that occurs 0-1 times, the wildcard parenthetical notation, () * , indicates the pattern occurs 0 or more times, httpmethod is either GET or POST, path is a relative url from root (e.g. funset.uno/path) that idenfies the corresponding API endpoint, < id > is the unique id of the object (where applicable), an optional_parameter is a url-encoded parameter the endpoint optionally accepts, (parameter_key : value_type) * is a list of required parameters (e.g. POST parameters) that, where applicable, are encoded following the encoding_type. The runs/invoke method is the primary endpoint on the API and facilitates the creation of a new run object, following the schema defined below, in a JSON format. Broadly speaking, a run is an object that encapsulates the results of an instantiation of the enrichment and spectral clustering algorithm. In this way, run contains the results of execution as a set of enriched terms, each of which is represented as an enrichment object, following the schema below. The runs/invoke endpoint will produce well defined output enrichments when the POST parameters take on any of the following values: • all gene strings in the genes list are valid GO gene ids; • all gene strings in the background list are valid GO gene ids; • the p-value, representing the false detection rate to use for the run is a float between 0 and 1; • the clusters parameter is an integer from 1 to the total number of target genes supplied, representing the desired number of clusters to use in the spectral clustering algorithm, or -1 for automatic detection of the optimal number; and • the organism parameter is one of the following 3-letter codes: ['hsa', 'gga', 'bta', 'cfa', 'mmu', 'rno', 'cel', 'ath', 'dme', 'sce', 'eco', or 'dre'] To retrieve the data for each of the enriched terms, one should make an additional request to the GET /enrichments endpoint defined below, for each enriched term id listed in the run.enrichments field.
GET /runs/< id> returns a previously completed run object specificed by the < id > or a 404 Not Found error, if the < id > does not point to a valid run object.
GET /enrichments/< id>?include=term, term.parents,genes returns an enrichment term's data, whose primary key is < id >, corresponding to the enrichment schema below or a 404 if the term specified by the does not exist. If passed the include parameter with term, term.parents, and/or genes, the method will also fetch and return all related term and gene fields, see term and gene schemas, respectively, below.
GET /runs/< id>/recluster?clusters= < num_clusters> re-runs the spectral clustering algorithm for an existing run specified by < id >, grouping terms into a number of clusters equal to num_clusters as specified by the url encoded parameter clusters, where num_clusters must be a number between 1 and the total number of terms in the background set. This method returns a run object with the same structure as /runs/invoke, or returns 404 Not Found if the run specified by the < id > is not an extant valid run.
GET /terms/< id> returns GO term data, following the term schema below, for the term matching the < id >, or a 404 Not Found error if the term does not exist.
GET /genes/< id> returns gene data, matching the gene schema below, for the gene specified by the < id >, or a 404 Not Found error if the id is invalid. Gene • id (int) • name (string) Table 2 shows Funset's API data schema.

Visualization Techniques
To visualize the results of the GO enrichment analysis, we built a client-side front-end as a web application using Ember.js [19,20] and D3.js [21]. The web application allows users to specify a target gene set, a background gene set, p-value and an ontology, namespace, and organism to be used for enrichment analysis. Given the user selections, the web application invokes the runs/invoke API described above, mapping the user selections in the interface to the input parameters as specified. The JSON results returned by the API are then rendered into an SVG visualization. The FunSet visualization represents terms in a 2D coordinate space, where terms are positioned using Multidimensional Scaling (MDS) on the distance matrix obtained from the pairwise AIC semantic similarity index described before. A term's x,y coordinate location in the svg is characterized by the following formula.
where svg w and svg h are, respectively, the pixel width and height of the svg as it fits in the user's browser and sc x and sc y are, respectively, the spectral clustering x and y results, ranging from 0 to 1. In effect, this scales the SVG to the user's browser size, while maintaining the original, location significant, aspect ratio. Node size in the visualization graph is scaled according to the enrichment size effect produced by the enrichment analysis. The enrichment size for a term is calculated as the number of genes associated with a term in the target set divided by the expected number of genes. After setting initial term locations to be the scaled clustering location, FunSet's visualization interface then applies a velocity Verlet using D3's force library [22] to each term to distribute terms away from one another, uniformly, within the SVG space. This technique is used to mitigate scenarios where terms are tightly stacked within a cluster -making visual interpretation difficult. The Verlet numerical integrator used in FunSet simulates physical motion of terms in the SVG by applying a constant acceleration a over a time interval t to the term's velocity, changing its (x,y) position at each time step. With velocity initially set to 0, this accelerates terms in the graph by adding a to the term's velocity at each time step. To disperse terms, without disrupting the underlying cluster structure, we apply a uniform repulsive force to each term that simulates magnetic repulsion. At the same time, a link-force is applied for terms with parent/child relationships in the data. Finally, a decay function simulating physical friction stabilizes the graph and allows it to reach a steady state. The entire physics simulation is compute optimized to perform well even for large networks of enriched terms.
FunSet auto-expands the cluster and term panels and then jumps to the enriched term's reference material on the right-hand side when a user clicks a node to inspect it further. GO terms are linked to Amigo [23] so that users can jump directly to the external GO term reference page.  cluster (int -the cluster to which the enriched term is assigned) parents (list of id (int), defining a many-to-many relationship to Term) medoid (boolean -true if this term is the medoid of its cluster) genes (list of id (int), defining a one-tomany relationship to Gene that represents all genes enriched in the sample) The boldface items represent the data field names (i.e., the fields in the schema) Figure 1 shows an overview of the FunSet visualization user interface. The SVG space with the clusters of terms is shown on the left. This area is pannable and zoomable by left clicking and dragging or using the mouse wheel, respectively. The right hand side of the interface shows information about the computed enrichment analysis, including the time when the run was created, which is clickable to copy a permanent link that the user can use to return to this run data, the total number of terms within the ontology data used, the total terms that were found to be enriched, and then the set of clusters the enriched terms fell into. The interface allows the user to change the number of desired clusters using the right hand slider. The interface also allows the user to show or hide clusters by toggling the cluster visibility buttons. Clicking a cluster will expand it to show the enriched terms with their associated data such as the false discovery rate FDR and the enrichment size ES. The term panel allows users to click a particular term to highlight it, in red, in the SVG graph. Clicking a term in this panel will also display the term's description. A second panel (not shown in the figure) shows the specific genes contributing to the enrichment for each cluster.

Running the GO enrichment analysis and visualizing the results
The visualization UI also allows the user to export the results of a run as an SVG, as JSON, and as a CSV. Both the JSON and CSV data structures follow a hierarchical format consistent with the API description. The interface also allows a user to export JSON data regarding a particular cluster. The interface also allows users to click nodes in the graph to expand their term information within a cluster.

Case study: comparing enrichment analysis results across time
A study by Wadi et al. showed that outdated enrichment tools could only recover 26% of biological processes and pathways identified with more up-to-date resources [2]. As a proof-of-principle, we used FunSet to perform GO enrichment analysis in the "biological process" namespace on a list of predicted cancer "driver" genes [24] using 2013 and 2018 GO vocabulary and annotations, respectively. The results show a substantial difference in the number of enriched terms, with 364 gained terms with respect to the 2013 version, and 64 "lost" terms (Fig. 2).
We used the same list of genes to highlight how clustering can help to summarize long list of enriched terms. As shown in Fig. 3, FunSet automatically identified twelve clusters of terms, and returned the representative (medoid) term for each cluster. The representative terms are shown in Table 3.

Discussion
Enrichment analysis is a widely used bioinformatics approach that enables experimental and computational investigators to extract meaningful information from long lists of genes. Here we introduced FunSet, a new web server for performing and visualizing GO enrichment analysis interactively through a web server, programmatically via an API, or from the command line. We also discussed a case study that illustrates the impact of time (and therefore different versions of the GO vocabulary and annotations) on the results of otherwise identical enrichment analyses. This points to the importance of using time-stamped versions of the GO vocabulary and corresponding annotations when attempting to reproduce computational analyses. To the best of our knowledge, this is the first time that a comprehensive tool for GO enrichment analysis and visualization allows users to use historical GO data. The case study also illustrates the use of clustering to identify meaningful groups of terms that can be summarized with one representative term per cluster, automatically chosen by FunSet. We note that while FunSet can determine an optimal number of clusters with the eigengap procedure [15], users still have the option (and are encouraged) to explore with different number of clusters, to identify groups of terms that match their biological intuition at the desired granularity level.

Conclusions
We have introduced a novel tool named FunSet to perform and visualize GO enrichment analysis. By having access to the full documented source code of the pipeline, users can deploy FunSet on a private cloud for increased computational performance, and potentially customize it using other controlled vocabularies. Further, the availability of a documented, open-source standalone program allows users to incorporate FunSet into other bioinformatics pipelines or extend its features.

Availability and requirements
Project name: FunSet Project home page: http://funset.uno Operating system(s): Platform independent (web server); Linux, Mac OS X (command-line software) Programming language: Python, C++, JavaScript Fig. 3 Clustering of enriched tems. The list of predicted cancer driver genes in [24] yields 630 enriched GO terms in the biological process namespace using 2018 GO data. Funset automatically identified 12 representative clusters using the eigengap approach [15] Table 3 Representative terms (medoid terms) in the biological process namespace automatically identified by FunSet for the gene list reported in [24]