- Open Access
gsGator: an integrated web platform for cross-species gene set analysis
BMC Bioinformaticsvolume 15, Article number: 13 (2014)
Gene set analysis (GSA) is useful in deducing biological significance of gene lists using a priori defined gene sets such as gene ontology (GO) or pathways. Phenotypic annotation is sparse for human genes, but is far more abundant for other model organisms such as mouse, fly, and worm. Often, GSA needs to be done highly interactively by combining or modifying gene lists or inspecting gene-gene interactions in a molecular network.
We developed gsGator, a web-based platform for functional interpretation of gene sets with useful features such as cross-species GSA, simultaneous analysis of multiple gene sets, and a fully integrated network viewer for visualizing both GSA results and molecular networks. An extensive set of gene annotation information is amassed including GO & pathways, genomic annotations, protein-protein interaction, transcription factor-target (TF-target), miRNA targeting, and phenotype information for various model organisms. By combining the functionalities of Set Creator, Set Operator and Network Navigator, user can perform highly flexible and interactive GSA by creating a new gene list by any combination of existing gene sets (intersection, union and difference) or expanding genes interactively along the molecular networks such as protein-protein interaction and TF-target. We also demonstrate the utility of our interactive and cross-species GSA implemented in gsGator by several usage examples for interpreting genome-wide association study (GWAS) results. gsGator is freely available at http://gsGator.ewha.ac.kr.
Interactive and cross-species GSA in gsGator greatly extends the scope and utility of GSA, leading to novel insights via conserved functional gene modules across different species.
High-throughput experiments such as microarray, next generation sequencing and mass spectrometry-based proteomics, provide genome-scale molecular profile data in an unbiased manner. Gaining biological insights on underlying mechanisms, however, requires interpretation of several hundreds or even thousands of candidate genes. Gene set analysis (GSA) has been highly successful to interpret the result from high throughput experiments and a number of GSA tools have been developed such as DAVID , GeneCodis3 , WebGestalt2 , g:Profiler , GARNET , ToppCluster , just to name a few.
Still, current progress on functional genomics annotation is far from complete. Particularly, phenotypic annotation is sparse for human genes, but is far more abundant for other model organisms such as mouse , fly , and worm . However, cross-species GSA has not been extensively used to interpret human gene lists. Interpretation of omics experiments usually is done in a highly interactive manner. For example, the gene list from the same experiment (e.g. microarray) needs to be analyzed multiple times using different stringency criteria. Several gene sets from related experiments need to be combined for another GSA by operating set union, intersection or difference. Analysis of gene lists in the context of molecular networks may lead to novel mechanistic insights. For example, a gene list can be trimmed by taking only genes directly connected to each other or, alternatively, expanded to include network neighbors in different types of molecular network. Flexible combination of such interactive features is highly desirable for any web-based GSA tool, which greatly increases its sensitivity and interpretability.
While the utility of interactive and cross-species GSA is evident, few tools support such functionality in a single, unified environment (Table 1). Here, we developed a web-based tool, gsGator with many useful features such as cross-species GSA and a network viewer. The whole analysis is virtually automated with a convenient drag-and-drop interface. A broad range of gene annotations are collected for seven common model organisms of human, mouse, fly, worm, yeast, Arabidopsis and E. coli. Ample information on phenotypes and functions is available for these model organisms, and the full list of annotation types and their statistics is available on-line.
Construction and content
The gsGator consists of four main modules of set creator, set operator, set analyzer and network navigator. Following is the brief introduction of the main features in each module and its user interface. We chose NCBI Entrez Gene ID as unique gene identifier throughout the whole system. The pre-compiled gene sets are listed under ‘public’ category and the user-input gene sets under ‘private’ category. In gsGator, GSA is always performed between the private and the public annotations. A typical analytic procedure in gsGator is shown in Figure 1.
Set Creator lets user define new gene sets de novo, where the input gene/protein ids are automatically converted to Entrez Gene ID (Additional file 1: Table S1). Gene lists can be directly put in the input box or uploaded as a file. Set creator is equipped with orthology mapping tool, where any gene set can be converted to its orthologous set for multiple model organisms in a single step. For each model organism, a separate gene set is created e.g. ‘example_set_yeast’, ‘example_set_mouse’, etc. for a set named 'example_set'. Currently, orthology mapping is done using the information from InParanoid database . All the user-created gene sets are deposited in the private category.
Set Analyzer lets user perform gene set analysis (GSA) using a convenient drag-and-drop user interface. Simultaneous GSA of multiple gene sets is allowed. First, user selects the target species, where human (Homo sapiens) is set as default. Once a species is selected, only the relevant gene sets to the species are listed under the category tree for both public and private section. User can select the target gene sets from both public and private category by drag-and-drop using mouse. Because GSA is performed between the selected private and public categories, at least one private and one public category should be selected under the input area. The significance of GSA is calculated using hypergeometric test and kappa statistics, where multiple test correction is applied by Benjamini-Hochberg method. The result of GSA is presented in a table with its statistical significances (p-value and q-value) presented as heatmap. The GSA result can be also visualized as a network for an intuitive inspection of the results.
Set Operator is for creating a new gene set by the union, intersection or difference of two preexisting gene sets. The resulting gene list is directed to set creator for generating a new gene set.
Network navigator allows user to explore molecular networks such as protein-protein interaction (PPI), TF-target and miRNA-target relations. Starting with a particular gene set as seed, the user can expand genes along the molecular networks in an interactive fashion. Selecting a node and right-click on the mouse triggers a pop-up menu to choose the type of network for expansion. Once the modification of network is complete, the remaining nodes (genes) can be exported to set creator to generate a new gene set. Combined use of set operator and network navigator allows user to create a new gene set in a highly flexible and interactive manner using any preexisting gene sets.
Utility and discussion
According to our survey on the datasets in gsGator, the fraction of human genes with any phenotypic annotation is only 40.9%, most of which are genetic diseases (Table 2). Because a single gene is frequently associated with many phenotypes, this number of annotation coverage should be overestimated in reality. Gene annotation from model organisms is a rich source for inferring the function of human genes. By taking advantage of phenotype and protein-protein interaction network of orthologous genes from other model organisms, the coverage for human genes increases by 5.4% and 13.3%, respectively (Figure 2). This gain of phenotypic information is likely to increase for a while, because the rate of phenotypic characterization is likely to be much faster for model organisms than human. Similarly, 12% of additional coverage is gained for protein-protein interaction (PPI) network. Although gene function and network structure may have diverged significantly between human and model organisms, functional gene modules are often unexpectedly well conserved by deep homology .
In order to test the utility of gsGator, we focus on mouse phenotypic annotation because of the paucity of human counterpart and thus, the benefit of interactive and cross-species GSA can be more evident. Here, we took three example cases, where GSA was performed using GWAS hit genes as input. Typically, GWAS identifies genetic variations associated to certain phenotypes, resulting in a small number of associated genes as hits. The paucity of human phenotypic annotation makes it difficult to cross-confirm GWAS results. Frequently, GWAS hit genes do not belong to the same pathway or are not directly connected to each other in molecular networks, making it hard to gain mechanistic insight. The number of GWAS hit genes is often too small to get enough statistical significance by conventional GSA. In the three case examples below, the GSA results by conventional GSA for human phenotypic annotation is compared with those by GSA using the features of 1) cross-species GSA, 2) Network expansion of input genes and 3) Network expansion + cross-species GSA. The procedure of preparing the input genes is illustrated in Figure 3. The details of the analytic procedures are also available in the Additional file 2 including all the input gene sets and the step-by-step screenshots of gsGator web interface.
Case example 1: Cross-species GSA
In the first case, five genes from a genome-wide association (GWA) study for eye color were used as an input gene list . Conventional GSA resulted in no significant hits for human phenotypic annotation [22, 23]. However, a simple cross-species GSA identified many related annotations using mouse phenotypic annotations such as decreased eye pigmentation (q-value = 2.5e-6) and diluted coat color (q-value = 2.8e-5) (Table 3).
Case example 2: Network expansion
The second case takes a bit more elaborate approach, where the results of two GWA studies for adiposity [24, 25] were combined by the union (seven genes) of two hit gene lists (three and four genes respectively). Neither conventional nor simple cross-species GSA resulted in any significant GSA hits for phenotypic annotation. In the network navigator, we observed that none of the seven input genes were connected to each other by PPI network. By expanding each input gene via PPI network, a larger input set of 104 genes was created, where four of the initial seven genes were indirectly connected by the expanded network neighbors as intermediate. GSA using the network-expanded 104 genes as input provided even rich phenotypic interpretations with many GSA hits on energy metabolism including insulin resistance (q-value = 5.0e-3), body fat (q-value = 9.6e-3), and glucose tolerance (q-value = 4.3e-2) (Table 4). This example demonstrates that network-expanded GSA allows even more sensitive and extensive interpretation of gene lists with improved statistical significance.
Case example 3: Network expansion + Cross-species GSA
Finally, the third case shows GSA with a combination of network expansion and cross-species GSA. As input, seven GWAS hit genes for venous thrombosis (VT) are used . There is no significant GSA hit for phenotypic annotation using the simple cross-species GSA approach as in case 1 (3 input genes in mouse, VT_mouse). Apparently, the network expansion (43 input genes in human, Net_VT) resulted in some significant hits including pregnancy loss (q-value = 4.9e-9) and brain hemorrhage (q-value = 1.5e-4). However, the common genes between the input and the target gene set were only 2 ~ 4 genes due to the scarcity of human phenotypic annotation, making this GSA results less convincing (Table 5). Next, we created a network-expanded & orthology mapped set of 34 mouse genes (Net_VT_Mouse) by combining the features of both set creator and network navigator. It resulted in even richer phenotypic interpretations, having more than four times of GSA hits (45 annotations for Net_VT_Mouse shown in Table 6) than the human network-expanded set (11 annotations for Net_VT shown in Table 5) at the same cut-off (q-value < 0.05). These GSA hits include many vasculature-related diseases such as abnormal blood coagulation (q-value = 7.5e-17), increased bleeding time (q = -value = 7.0e-6), and gastrointestinal hemorrhage (q-value = 3.4e-5). It demonstrates that cross-species and network-expanded GSA allows even more sensitive and extensive interpretation of gene lists with improved statistical significance.
gsGator is a fully integrated, web-based tool for gene set analysis (GSA), which allows highly flexible and interactive GSA analyses. A series of new gene sets can be created as a combination of any existing gene sets, which is highly desirable in most exploratory and discovery-oriented studies. According to our survey, cross-species GSA expands the coverage of phenotypic annotation by 5.4% and PPI network by 13.3% for human genes, respectively. Few existing tools are equipped with these functionalities in a single unified database, reducing the burden of consulting multiple web sites and bioinformatics tools. All the gene lists and analytic results can be exported for further processing and integration with other analytic results. As demonstrated in the three case examples, interactive and cross-species GSA greatly extends the scope and utility of GSA, leading to novel insights via conserved functional gene modules across different species.
Availability and requirements
gsGator is freely available at http://gsGator.ewha.ac.kr. gsGator user interface is implemented using Adobe FLEX 5.0 and the internal system is operated using JAVA and Apache web server. MySQL 5.5 is used for database management. Flash player and JAVA need be installed for using gsGator. Currently, gsGator is optimized and tested for Chrome or Firefox web browser in both Linux and Windows environment. Some of the features may be limited for other types of web browser e.g. Internet Explorer.
da Huang W, Sherman BT, Tan Q, Kir J, Liu D, Bryant D, Guo Y, Stephens R, Baseler MW, Lane HC, et al: DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic Acids Res. 2007, 35 (Web Server issue): W169-W175.
Tabas-Madrid D, Nogales-Cadenas R, Pascual-Montano A: GeneCodis3: a non-redundant and modular enrichment analysis tool for functional genomics. Nucleic Acids Res. 2012, 40 (Web Server issue): W478-W483.
Duncan D, Prodduturi N, Zhang B: WebGestalt2: an updated and expanded version of the Web-based Gene Set Analysis Toolkit. BMC Bioinforma. 2010, 11 (Suppl 4): 10-10.1186/1471-2105-11-S4-P10.
Reimand J, Arak T, Vilo J: g:Profiler--a web server for functional interpretation of gene lists (2011 update). Nucleic Acids Res. 2011, 39 (Web Server issue): W307-W315.
Rho K, Kim B, Jang Y, Lee S, Bae T, Seo J, Seo C, Lee J, Kang H, Yu U, et al: GARNET--gene set analysis with exploration of annotation relations. BMC Bioinforma. 2011, 12 (1): 25-10.1186/1471-2105-12-25.
Kaimal V, Bardes EE, Tabar SC, Jegga AG, Aronow BJ: ToppCluster: a multiple gene list feature analyzer for comparative enrichment clustering and network-based dissection of biological systems. Nucleic Acids Res. 2010, 38 (Web Server issue): W96-W102.
Zhu Y, King BL, Parvizi B, Brunk BP, Stoeckert CJ, Quackenbush J, Richardson J, Bult CJ: Integrating computationally assembled mouse transcript sequences with the Mouse Genome Informatics (MGI) database. Genome Biol. 2003, 4 (2): R16-10.1186/gb-2003-4-2-r16.
McQuilton P, St Pierre SE, Thurmond J: FlyBase 101--the basics of navigating FlyBase. Nucleic Acids Res. 2012, 40 (Database issue): D706-D714.
Yook K, Harris TW, Bieri T, Cabunoc A, Chan J, Chen WJ, Davis P, de la Cruz N, Duong A, Fang R, et al: WormBase 2012: more genomes, more data, new website. Nucleic Acids Res. 2012, 40 (Database issue): D735-D741.
Lopez D, Casero D, Cokus SJ, Merchant SS, Pellegrini M: Algal Functional Annotation Tool: a web-based analysis suite to functionally interpret large gene lists using integrated annotation and expression data. BMC Bioinforma. 2011, 12 (1): 282-10.1186/1471-2105-12-282.
Fontanillo C, Nogales-Cadenas R, Pascual-Montano A, De las Rivas J: Functional analysis beyond enrichment: non-redundant reciprocal linkage of genes and biological terms. PloS One. 2011, 6 (9): e24289-10.1371/journal.pone.0024289.
Xie C, Mao X, Huang J, Ding Y, Wu J, Dong S, Kong L, Gao G, Li CY, Wei L: KOBAS 2.0: a web server for annotation and identification of enriched pathways and diseases. Nucleic Acids Res. 2011, 39 (Web Server issue): W316-W322.
Chi SM, Kim J, Kim SY, Nam D: ADGO 2.0: interpreting microarray data and list of genes using composite annotations. Nucleic Acids Res. 2011, 39 (Web Server issue): W302-W306.
Paszkowski-Rogacz M, Slabicki M, Pisabarro MT, Buchholz F: PhenoFam-gene set enrichment analysis through protein structural information. BMC Bioinforma. 2010, 11: 254-10.1186/1471-2105-11-254.
Du Z, Zhou X, Ling Y, Zhang Z, Su Z: agriGO: a GO analysis toolkit for the agricultural community. Nucleic Acids Res. 2010, 38 (Web Server issue): W64-W70.
Lachmann A, Ma'ayan A: Lists2Networks: integrated analysis of gene/protein lists. BMC Bioinforma. 2010, 11: 87-10.1186/1471-2105-11-87.
Berriz GF, Beaver JE, Cenik C, Tasan M, Roth FP: Next generation software for functional trend analysis. Bioinformatics. 2009, 25 (22): 3043-3044. 10.1093/bioinformatics/btp498.
Baker EJ, Jay JJ, Bubier JA, Langston MA, Chesler EJ: GeneWeaver: a web-based system for integrative functional genomics. Nucleic Acids Res. 2012, 40 (Database issue): D1067-D1076.
Ostlund G, Schmitt T, Forslund K, Kostler T, Messina DN, Roopra S, Frings O, Sonnhammer EL: In Paranoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res. 2010, 38 (Database issue): D196-D203.
McGary KL, Park TJ, Woods JO, Cha HJ, Wallingford JB, Marcotte EM: Systematic discovery of nonobvious human disease models through orthologous phenotypes. Proc Natl Acad Sci USA. 2010, 107 (14): 6544-6549. 10.1073/pnas.0910200107.
Eriksson N, Macpherson JM, Tung JY, Hon LS, Naughton B, Saxonov S, Avey L, Wojcicki A, Pe'er I, Mountain J: Web-based, participant-driven studies yield novel genetic associations for common traits. PLoS Genet. 2010, 6 (6): e1000993-10.1371/journal.pgen.1000993.
Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2005, 33 (Database issue): D514-D517.
Becker KG, Barnes KC, Bright TJ, Wang SA: The genetic association database. Nature genetics. 2004, 36 (5): 431-432. 10.1038/ng0504-431.
Kilpelainen TO, Zillikens MC, Stancakova A, Finucane FM, Ried JS, Langenberg C, Zhang W, Beckmann JS, Luan J, Vandenput L, et al: Genetic variation near IRS1 associates with reduced adiposity and an impaired metabolic profile. Nature genetics. 2011, 43 (8): 753-760. 10.1038/ng.866.
Lindgren CM, Heid IM, Randall JC, Lamina C, Steinthorsdottir V, Qi L, Speliotes EK, Thorleifsson G, Willer CJ, Herrera BM, et al: Genome-wide association scan meta-analysis identifies three Loci influencing adiposity and fat distribution. PLoS Genet. 2009, 5 (6): e1000508-10.1371/journal.pgen.1000508.
Germain M, Saut N, Greliche N, Dina C, Lambert JC, Perret C, Cohen W, Oudot-Mellakh T, Antoni G, Alessi MC, et al: Genetics of venous thrombosis: insights from a new genome wide association study. PloS One. 2011, 6 (9): e25581-10.1371/journal.pone.0025581.
Funding: National Research Foundation of Korea (NRF) grants funded by MEST of KOREA (2011–0014992, 2013M3A9B6046519, 2012M3A9C5048707, 2012M3A9D1054744); GIST Systems Biology Infrastructure Establishment Grant (2012–3) through Ewha Research Center for Systems Biology (ERCSB); Ewha Womans University Research Grant of 2013.
The authors declare that they have no competing interests.
HK, SL and WK participated in the design of the study, conducted computational and statistical analysis and wrote the manuscript. SC and DR participated in collecting data sets and build databases. IC implemented the web application and computational analysis. All authors read and approved the final manuscript.