Var2GO: a web-based tool for gene variants selection
BMC Bioinformatics volume 17, Article number: 376 (2016)
One of the most challenging issue in the variant calling process is handling the resulting data, and filtering the genes retaining only the ones strictly related to the topic of interest. Several tools permit to gather annotations at different levels of complexity for the detected genes and to group them according to the pathways and/or processes they belong to. However, it might be a time consuming and frustrating task. This is partly due to the size of the file, that might contain many thousands of genes, and to the search of associated variants that requires a gene-by-gene investigation and annotation approach. As a consequence, the initial gene list is often reduced exploiting the knowledge of variants effect, novelty and genotype, with the potential risk of losing meaningful pieces of information.
Here we present Var2GO, a new web-based tool to support the annotation and filtering of variants and genes coming from variant calling of high-throughput sequencing data. Var2GO permits to upload either the unprocessed Variant Calling Format file or a table containing the annotated variants. The raw data undergo a preliminary step of variants annotation, using the SnpEff tool, and are converted to a table format. The table is then uploaded into an on the fly generated database. Genes associated to the variants are automatically annotated with the corresponding Gene Ontology terms covering the three GO domains. Using the web interface it is then possible to filter and extract, from the whole list, genes having annotations in the domain of interest, by simply specifying filtering parameters and one or more keywords. The relevance of this tool is demonstrated on exome sequencing data.
Var2GO is a novel tool that implements a topic-based approach, expressly designed to help biologists in narrowing the search of relevant genes coming from variant calling analysis. Its main purpose is to support non-bioinformaticians in handling and processing raw variant calling data through an intuitive web interface. Furthermore, Var2GO offers a complete pipeline that, starting from the raw VCF file, allows to annotate both variants and associated genes and supports the extraction of relevant biological knowledge.
Much progress has been made in Next Generation Sequencing (NGS) data analysis. More and more geneticists are introducing this technology in their routine work to exploit a wide, cost-effective approach in identifying and studying genetic variants across the genome. Whole-genome and whole-exome sequencing are widely used to investigate these variants as cause of rare and complex diseases. Handling genome or exome data is a powerful opportunity but not poor in challenges, due to technical difficulties in managing such big amounts of data. Transforming raw sequence reads into meaningful information might be a tough task, especially for small labs. Although exome sequencing reports much less variants compared to genome sequencing , the output coming from the entire pipeline of variant calling is a huge VCF (Variant Calling Format) file containing hundreds of thousand rows. The VCF file contains both single-nucleotide polymorphisms (SNPs) and small insertions and deletions (INDELs), annotated using several variants identification tools, such as ANNOVAR , SnpEff , AnnTools . These software provide several information such as the affected genes, the effects of the variants at the level of the protein products, or the minor allele frequency. However, they do not support any further investigation aimed to identify genes significant to the topic of study. To give a biological meaning to the obtained data, the scientist needs to filter the relevant information from hundreds of thousands of SNPs and INDELs called, since only a subgroup of variants is pertinent to the given domain of interest [5, 6].
A large amount of scientific literature proves that the Gene Ontology (GO), from the Gene Ontology Project , is an invaluable source of the most sophisticated meanings of biological systems available to scientists . GO annotations provide a bridge between the gene product and its function, its cellular localization and the biological processes in which it is involved. This knowledge is organized into three different GO ontologies: molecular function, cellular component and biological process, respectively. We focused on the GO annotation as a structured, precisely defined, common, controlled vocabulary for describing the roles of genes and gene products in any organism .
Many tools have been developed to annotate lists of genes in terms of the biological role of their products, and to filter the huge VCF variants lists. However, they often require advanced computational skills for both the installation process and their usage [10, 11]. When a web service is available, as in , usually only a list of genes is given in output, thus requiring further processing to match the initial data with the list of new annotations. This is an essential step to retain all the information collected during the pre-processing steps (quality parameters, position, variant ID, sample genotype and others), since they might be used to better investigate among variant features and obtain more reliable results.
As a consequence, the initial variant list is often greatly reduced by applying simple filtering criteria such as variants effect prediction, novelty and genotype. While this approach might help in reducing the size of the data, it greatly increase the risk of losing meaningful pieces of information.
Examples of available tools for gene annotation include: GoGene , which is conceived to allow gene annotation through GO and MeSH (Medical Subject Headings) vocabularies. It requires a list containing gene names/identifiers as input. The output list with annotation is easily viewable by the web interface, but, although the links to export the list in different file formats are provided, they do not seem to have any content. Furthermore, it does not work with input list containing more than 1000 genes; AnnoKey , which is usable exclusively from command line and it has a gene number limit in online mode; SNP2GO , which is an R package, hence requiring the user to have programming skills. Moreover, it is not available for all the recent R versions (tested under R v3.2.0 and v3.2.1).
Other software are conceived to allow the direct interaction with the variant table to apply filtering strategies: GEMINI , KGGSEQ , Plink/SEQ , which are more focused on variant rather than gene annotation, and all exclusively usable from command line. One available tool for visualizing annotated sequence variants and interacting with the VCF input file is SVA , which requires high-performance workstations or servers to properly work. Other available software include VCF-Miner  and EVA , but both of them need the installation of a complete web stack as a prerequisite. The most user-friendly tool is VarSifter , a Java-based software which requires a minimal installation. However, it can be mainly used as a filtering tool, since it does not offer further annotation features.
To support the researchers in the process of annotating and filtering all the variants obtained from sequencing experiments, we have developed a web-based tool named Var2Go. To the best of our knowledge, Var2GO is the first web tool which permits to upload a complete raw variants file, to annotate both the variants and the related genes, and filter them in an interactive way, obtaining a reduced file with all the needed information. Var2Go accepts as input either a raw VCF file or a tab-delimited annotated table with a maximum size of 500 MB. In the first case, the user must select the reference genome for the variant annotation (at the moment the available genome assemblies are hg19 and hg38 for Homo Sapiens, mm9 and mm10 for Mus musculus). The VCF file is annotated using the SnpEff 4.2  which produces a new VCF file in a format compliant with the latest annotation standard. After, the resulting file is converted in table format using SnpSift . SnpEff annotates more than one transcript for each gene variant. We decided to keep and show the most relevant one, based on the selection strategy described in the software documentation available at the SnpEff website . Instead, when uploading a file in table format, the user must select the species of interest and the fields containing the essential information (i.e. Gene symbol, Chromosome, Position, Reference and Alternative allele). Var2GO extracts the annotations for each gene among the three GO ontologies, and creates on the fly a relational database in which all the fields of the table and the relative content are stored. Through the web interface, it is possible to specify a list of key terms (whole words or portions) considered by the user as pertinent to the specific topic of study. Furthermore, Var2GO permits to dynamically interact with the input table applying filters to the information contained into the columns. Lastly, it is possible to select columns to be included in the resulting output file.
It is also possible to choose whether to convert the GT (Genotype) characters into HOM/HET/HOMREF nomenclature by a pre-processing step. After the database creation, the user can visualize the loaded data and specify the keywords and filters to use, using dropdown lists. The query is built using the specified filtering criteria on the columns content and keywords connected by selecting boolean operators.
After the annotation and filtering processes are complete, the user can download the resulting table in a tab-delimited format. If the variants annotation step is requested, two gzip files, one for the complete table and one for the relative transcripts info, are available for downloading. Furthermore, it is possible to download the entire database created by Var2Go. This option has been thought for users who have more advanced computational skills and can store the information ready to be filtered and parsed.
Results and discussion
Var2GO has a simple and user-friendly interface (see Fig. 1). The users is guided through the few steps needed to upload and create the database. After, the choice of the filters and key terms can be performed using a very intuitive approach. Results can be downloaded either as a simple tab-separated file or as a complete database, for more experienced users. The usefulness of Var2GO is tested on a data set coming from an exome sequencing experiment.
Case study: Glycogen storage disease exome sequencing
We used an exome sequencing dataset coming from a variant discovery study on a family with polyglucosan body myopathy . Polyglucosan body myopathies (PGBMs) are a group of glycogen storage diseases (GSDs) which affects almost exclusively the striated muscle . The glycogen storage disease are rare inherited metabolic disorders caused by malfunction of enzymes involved in synthesis or degradation of glycogen . There are over 12 different types classified accordingly to the involved enzyme and affected tissue. Since exogenous glucose is metabolized and stored in the liver and skeletal muscle, these are the two most involved tissues . The dataset we used was composed of samples coming from three affected relatives (2 siblings, 1 cousin) who presented a late-onset PGBM. The VCF file was obtained through a variant calling pipeline performed by the authors. SnpEff  and ANNOVAR  tools were used to annotate each called variant in terms of amino acid change, gene name, functional class, impact, frequency in NHLBI Exome Variant Server (ESP6500)  and 1000 Genomes Project [28, 29], and causative effects by the SIFT prediction algorithm . The annotated VCF file was then converted to a tab-delimited format by VariantsToTable, a walker of the Genome Analysis ToolKit (GATK) . The table, composed by 57 columns reporting variant information obtained from both calling and annotation steps, and 160501 variant rows, was uploaded to Var2GO. We filtered the variants by several annotations, such as ESP 6500 allele frequency, VQSLOD filter (derived by variant quality score recalibration walker of GATK to assign a well-calibrated probability to each variant call), SnpEff functional class, genotype and depth of coverage. In detail, we asked Var2GO to retain not silent variants which passed the VQSLOD filter, had a ESP 6500 minor allele frequency smaller than 0.02, a depth of coverage per sample greater than 20, and were in common to the three samples. Applying these filtering criteria we obtained a table containing 8978 genes, less than 50 % of the initial list (19453 genes). In order to narrow the search to the genes which had functionality related to the topic of interest, we applied a filter to the GO annotation field by typing the following keywords: ‘gluc’ (for ‘glucose’, ‘glucan’, ‘glucotransferase’, etc.), and ‘glyco’ (for ‘glycogen’, ‘glycolitic’, ‘glicosylation’, etc.). The resulting table contained 949 genes, each with more than one variant called and multiple GO annotations. Then, we applied a filter also to the field of the 1000 Genome allele frequency (<0.02) and we obtained 352 genes. At this point, the reduced table was ready to be downloaded and easily handled and interpreted. The reduced size of the resulting gene list allowed to retrieve additional information from pathway and/or gene-disease association studies, using, for example, the Database for Annotation, Visualization and Integrated Discovery (DAVID) . The submission of our list to this resource and the selection of KEGG , BioCarta , Reactome , OMIM  annotation categories, resulted in 44 chart records ranked in according to the p-value. The most significant term was ‘REACT_474:Metabolism of carbohydrates’ from Reactome, enriched by 20 genes and with a fold enrichment score of 6.0248. At this point, we better investigated these 20 genes and we found that 4 of them are associated to glycogen storage disease as reported by the Human Phenotype Ontology project (HPO) [37, 38], namely LDHA, PHKG2, ENO3 and GYG1. Hence, the use of Var2GO allowed to easily find the GYG1 variant, confirming the result obtained by Fanin et al. . Handling a reduced list of genes permits to better exploit the available knowledge and the online resources, thus to deeply investigate the variant genes without losing precious information. Moreover, besides known disease genes, Var2Go allows to explore variants in genes which may be involved in the pathophysiological process.
We developed Var2Go to support researchers in handling and interpreting gene variant data obtained from high-throughput sequencing experiments. Thanks to an user-friendly and web-based interface, this tool allows to annotate, filter and search large variant files with a small effort from the user and no programming skills or computational resource required, except a modern browser. Few easy steps are needed to reduce the initial table to variants having the desired quality and features, and to genes having ontology terms related to the domain of interest. We are currently working to add the HPO annotation for human genes, and more genomic databases to allow the variants annotation of different species. Further, we are planning to provide Var2Go as a cloud app in order to improve performance and allow the processing of larger files.
Availability and requirements
Next generation sequencing
Variant calling format
Single nucleotide polymorphism(s)
insertion(s) and deletion(s)
Data base management system
PolyGlucosan body myopathies
glycogen storage diseases
Lactate dehydrogenase A
PHosphorylase kinase gamma 2
Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, Krabichler B, Speicher MR, Zschocke J, Trajanoski Z. A survey of tools for variant analysis of next-generation genome sequencing data. Briefings in bioinforma. 2014; 15(2):256–78.
Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010; 38(16):164–4.
Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: Snps in the genome of drosophila melanogaster strain w1118; iso-2; iso-3. Fly. 2012; 6(2):80–92.
Makarov V, O’Grady T, Cai G, Lihm J, Buxbaum JD, Yoon S. AnnTools: a comprehensive and versatile annotation toolkit for genomic variants. Bioinformatics. 2012; 28(5):724–5.
Do R, Kathiresan S, Abecasis GR. Exome sequencing and complex disease: practical aspects of rare variant association studies. Human molecular genetics. 2012; 21(R1):1–9.
Snape K, Ruark E, Tarpey P, Renwick A, Turnbull C, Seal S, Murray A, Hanks S, Douglas J, Stratton MR, et al. Predisposition gene identification in common cancers by exome sequencing: insights from familial breast cancer. Breast cancer Res Treat. 2012; 134(1):429–33.
Consortium GO, et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research. 2004; 32(suppl 1):258–61.
Hill DP, Smith B, McAndrews-Hill MS, Blake JA. Gene Ontology annotations: what they mean and where they come from. BMC Bioinf. 2008; 9(Suppl 5):2.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene Ontology: tool for the unification of biology. Nature genetics. 2000; 25(1):25–9.
Hart SN, Duffy P, Quest DJ, Hossain A, Meiners MA, Kocher J-P. VCF-Miner: GUI-based application for mining variants and annotations stored in VCF files. Briefings in bioinforma. 2016; 17(2):346–51.
Coutant S, Cabot C, Lefebvre A, Léonard M, Prieur-Gaston E, Campion D, Lecroq T, Dauchel H. EVA: Exome Variation Analyzer, an efficient and versatile tool for filtering strategies in medical genomics. BMC Bioinf. 2012; 13(Suppl 14):9.
Plake C, Royer L, Winnenburg R, Hakenberg J, Schroeder M. GoGene: gene annotation in the fast lane. Nucleic Acids Research. 2009; 37(suppl 2):300–4.
Park DJ, Nguyen-Dumont T, Kang S, Verspoor K, Pope BJ. Annokey: an annotation tool based on key term search of the ncbi entrez gene database. Source Code Biol Med. 2014; 9(1):15.
Szkiba D, Kapun M, von Haeseler A, Gallach M. SNP2GO: Functional analysis of genome-wide association studies. Genetics. 2014; 197(1):285–9.
Paila U, Chapman BA, Kirchner R, Quinlan AR. GEMINI: integrative exploration of genetic variation and genome annotations. PLoS Comput Biol. 2013; 9(7):1003153.
Li M-XX, Gui H-SS, Kwan JS, Bao S-YY, Sham PC. A comprehensive framework for prioritizing variants in exome sequencing studies of mendelian diseases. Nucleic Acids Res. 2012; 40:e53.
Plink/SEQ Home Page. https://atgu.mgh.harvard.edu/plinkseq/index.shtml. Accessed 26 Oct 2015.
Ge D, Ruzzo EK, Shianna KV, He M, Pelak K, Heinzen EL, Need AC, Cirulli ET, Maia JM, Dickson SP, et al. SVA: software for annotating and visualizing sequenced human genomes. Bioinformatics. 2011; 27(14):1998–2000.
Teer JK, Green ED, Mullikin JC, Biesecker LG. VarSifter: visualizing and analyzing exome-scale sequence variation data on a desktop computer. Bioinforma. 2012; 28(4):599–600.
Cingolani P, Patel VM, Coon M, Nguyen T, Land SJ, Ruden DM, Lu X. Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program. SnpSift. 2012; 3:35.
SnpEFF 4.2 - Genetic Variant Annotation and Effect Prediction Toolbox. http://snpeff.sourceforge.net/. Accessed 28 Apr 2016.
PLUPLOAD 2.1.8 - A File Upload Plugin. http://www.plupload.com.
Fanin M, Torella A, Savarese M, Nigro V, Angelini C. GYG1 gene mutations in a family with polyglucosan body myopathy. Neurology Genet. 2015; 1(3):21.
Oldfors A, DiMauro S. New insights in the field of muscle glycogenoses. Curr Opin Neurol. 2013; 26(5):544–53.
Wolfsdorf JI, Weinstein DA. Glycogen storage diseases. Rev Endocr Metab Disord. 2003; 4(1):95–102.
Ozen H. Glycogen storage diseases: new perspectives. World J Gastroenterol WJG. 2007; 13(18):2541–553.
NHLBI Exome Variant Server. http://evs.gs.washington.edu/EVS.
1000 Genomes Project. http://www.1000genomes.org.
Consortium GP, et al. A map of human genome variation from population-scale sequencing. Nature. 2010; 467(7319):1061–73.
Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nature protocols. 2009; 4(7):1073–81.
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation dna sequencing data. Genome Res. 2010; 20(9):1297–303.
Dennis Jr G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA, et al. DAVID: database for annotation, visualization, and integrated discovery. Genome biol. 2003; 4(5):3.
Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000; 28(1):27–30.
Nishimura D. BioCarta. Biotech Software & Internet Report: The Computer Software Journal for Scient. 2001; 2(3):117–20.
Joshi-Tope G, Gillespie M, Vastrik I, D’Eustachio P, Schmidt E, de Bono B, Jassal B, Gopinath G, Wu G, Matthews L, et al. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 2005; 33(suppl 1):428–32.
Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2005; 33(suppl 1):514–7.
Robinson PN, Köhler S, Bauer S, Seelow D, Horn D, Mundlos S. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am J Human Genet. 2008; 83(5):610–5.
Köhler S, Doelken SC, Mungall CJ, Bauer S, Firth HV, Bailleul-Forestier I, Black G, Brown DL, Brudno M, Campbell J, et al. The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 2014; 42(D1):966–74.
The authors thank Giulio Piluso and Annalaura Torella for kindly providing the VCF file used in the case study .
This article has been published as part of BMC Bioinformatics Vol 17 Suppl 12 2016: Italian Society of Bioinformatics (BITS): Annual Meeting 2015. The full contents of the supplement are available online at <http://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-17-supplement-12>.
This work has been funded by MIUR PON02-00619 and by Interomics Italian Flagship Project. Mario R. Guarracino work has been conducted at National Research University Higher School of Economics (HSE) and has been supported by the RSF grant n. 14-41-00039.
Availability of data and materials
The test files and a user guide are available on the tool website.
IG and MS conceived the project. IG proposed the case study and curated the biological interpretation. MS designed, developed and tested the Perl and PHP codes. FM designed the graphical interface, and developed the JQuery and PHP codes. MM helped in revising the existing software and tools. MRG supervised the whole project. IG and MS wrote the manuscript, which all authors read and approved.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
About this article
Cite this article
Granata, I., Sangiovanni, M., Maiorano, F. et al. Var2GO: a web-based tool for gene variants selection. BMC Bioinformatics 17 (Suppl 12), 376 (2016). https://doi.org/10.1186/s12859-016-1197-0