- Open Access
VariVis: a visualisation toolkit for variation databases
BMC Bioinformatics volume 9, Article number: 206 (2008)
With the completion of the Human Genome Project and recent advancements in mutation detection technologies, the volume of data available on genetic variations has risen considerably. These data are stored in online variation databases and provide important clues to the cause of diseases and potential side effects or resistance to drugs. However, the data presentation techniques employed by most of these databases make them difficult to use and understand.
Here we present a visualisation toolkit that can be employed by online variation databases to generate graphical models of gene sequence with corresponding variations and their consequences. The VariVis software package can run on any web server capable of executing Perl CGI scripts and can interface with numerous Database Management Systems and "flat-file" data files. VariVis produces two easily understandable graphical depictions of any gene sequence and matches these with variant data. While developed with the goal of improving the utility of human variation databases, the VariVis package can be used in any variation database to enhance utilisation of, and access to, critical information.
Although there has been effort over the last few years to improve the quality of variation databases, with the Human Genome Variation Society publishing guidelines covering fields such as variant nomenclature , database content, structure and deployment , and quality control , much of the data contained in variation databases remains difficult to access. In a survey of locus specific databases (LSDBs) in 2002, Claustres et al. noted that only 54% of examined databases would fit minimal criteria for ease of use, only "some" depicted the distribution of variation within a gene and "few" possessed graphical displays, especially of a dynamic nature . Huge sums of money have been invested in the search for the underlying genetic causes of disease, but much of that investment is wasted if anyone who desires the existing data is unable to not only access, but also understand, what is being presented . Variation databases are heavily underfunded and usually run 'on the side' by researchers while pursuing their funded research interests . This lack of time and money means that database curators often are unable to devote the necessary time to developing useful visualisation tools. However, the data that languish in these databases because they are so poorly presented could provide answers to the cause of and assist clinicians in diagnosing many human diseases if utilised effectively.
To encourage greater understanding and facilitate the interpretation of data presented in the numerous locus specific databases available on the internet, we have developed software that will give the curators of variation databases quick and easy access to some basic visualisation tools.
In one sense, such visualisation tools are commonplace. The UCSC Genome Browser  is an excellent tool for viewing sequence annotations, providing both graphical and text-based views. Users can view various annotation sets, including cDNA evidence, predicted genes and variation data . Genewindow, developed by the National Cancer Institute, provides a gene-centric view of the human genome specifically designed for variation visualisation .
However, these tools are of limited use to LSDB curators as they do not enable them to easily utilise their own data sets, which are often more complete than the published literature , although the PhenCode project has gone some way to mitigate the difficulty in loading an LSDB into the UCSC Browser . However, the requirement of directing users away from the website of a locus specific database in order to make use of these browsers is undesirable and negates the benefits of the gene-specific information that the majority of LSDBs offer above and beyond variation data: 52% of locus specific databases contain additional information on the diseases caused by the catalogued variants and 34% provide clinical information for both clinicians (91%), patients and their families (66%) . While incorporating variation data into genome wide browsers is desirable, it does not replace the need for data visualisation tools that can be incorporated into an LSDB's own web site.
On the other hand, specialist variation database software packages such as UMD  and MUTbase  that include visualisation require curators to use a specific database schema and user interface, which may not always be practical or sufficient.
VariVis is a collection of Perl scripts designed to provide a basic set of visualisation tools specifically for LSDBs that works in parallel with the existing user interface and database of an LSDB and can access variation data stored in a wide variety of formats including Database Management Systems (DBMSs) such as MySQL, Oracle and PostgreSQL, through to flat-file repositories such as comma or tab delimited text files. LSDBs utilising the specialised variation database software package LOVD , are especially suited to incorporating VariVis into their repertoire of tools. Gene sequences and annotations can be accessed from a locally stored file in any of a large number of sequence file formats, including the FASTA, BSML and GenBank formats (see Table 1); or VariVis can be directed to automatically retrieve sequences from any of several online sequence databases, such as GenBank  or EMBL .
Upon each execution of the program, VariVis uses the external BioPerl modules  to access the sequence data and stores the nucleotide sequence and any available structural annotations. The nucleotide sequence is then automatically numbered according to the HGVS variation numbering guidelines . The Perl DBI module  is then used to retrieve sequence variants from the database. As each variant is added, a set of hand written regular expressions are applied to the variant name in order to determine the variant type and location. These regular expressions are capable of recognising substitution, deletion, insertion, duplication, insertion-deletion (indel) and inversion events. More complex rearrangements are currently planned for inclusion in a later release.
The graphics are rendered by a dedicated module using basic HTML output to the user's browser.
VariVis is capable of producing two different types of graphical representations of the sequence and variation data provided. Both views initially display an overview of the gene structure, divided into introns and exons. Clicking on any of these divisions will display the selected intron or exon in further detail.
The standard view (Figure 1) is reminiscent of traditional sequence depictions, with the sequence broken up into discrete chunks and stacked horizontally down the page. Variant nucleotides are superimposed in red above the gene sequence and their corresponding effect on the amino acid sequence in red below the amino acid sequence.
The second view (Figure 2), the "Gel View," has the same functionality as the first viewing option, but this time orientates the sequence vertically, allowing for an unbroken stream of data. Theoretically, it is possible that any given nucleotide in a gene can be mutated to any other nucleotide base, deleted entirely, or have an adjacent insertion. Thus, the "Gel View" displays all possible nucleotide combinations for each position, a novel strategy, highlighting the nucleotides present in the reference sequence and any variations in contrasting colours. A horizontal version of this format is also available (not shown).
In both views, the software also displays any structural annotations, such as promoter sequences and UTRs as bands of colour running in tracks alongside the gene sequence. The software also provides access to the raw sequence data, allowing users to copy or download the entire sequence, or specific chunks, negating the need to navigate to a dedicated sequence database. Clicking a variant nucleotide provides the user with a brief overview of the clinically important data present in the database for the corresponding variant, from which the user can choose to view the original report of the variant via PubMed or perform a Google Scholar search for papers discussing the variant.
One of the major factors we took into consideration during development of the VariVis package was ease of use, not only for the database user, but also the curator. Once the VariVis package is installed, it requires no further action on behalf of the curator, allowing them to focus entirely on collecting new data. As the graphics are generated dynamically at the time of viewing, new variants are automatically included in the displays as soon as they are added to the database. VariVis can be used to provide an overall map of a gene of interest that users can explore at their leisure, or to illustrate specific variants or features of the gene by linking directly to specific areas within the gene.
The program is easily installed by copying the source files into any directory on a web server capable of executing CGI scripts, and then configured using the provided installation script.
The study of genetic variation has and will continue to yield remarkable health benefits for all humans, and while the data currently available in online variation databases are extremely valuable, the data presentation methods employed make accessing and understanding those data very difficult. As the number of variation databases expands and more variants are discovered, the need for better presentation methods will become more apparent.
It has been argued that the amount of data that will eventually be generated from a complete catalogue of all possible genetic variants will outstrip the data generated by the Human Genome Project . It is interesting to note then, that while "fancy tools" for displaying sequence data exist in the form of genome browsers such as the UCSC browser , very few tools currently exist for displaying sequence variation. While the variation database management software packages UMD  and MUTbase  do contain some visualisation tools, these tools are only available to the users of these particular database software packages. Visualisation tools are not available to database curators who cannot, or have chosen not to use these systems.
There is a lack of visualisation tools for variation data that can be implemented on any database system. The VariVis software package is an attempt to rectify this situation by providing database curators with a visualisation tool capable of easily combining the highly curated variation data within LSDBs with sequence and annotation data regardless of their underlying database and user interface. While developed with the study of human variation in mind, the VariVis software package could be implemented in databases devoted to the cataloguing of variation within any species, particularly viruses, parasites and plants.
Very soon, the Human Variome and the myriad of health consequences it engenders will be fully described and annotated [19, 20]. Novel systems need to be developed to allow fast and efficient access and use of the enormous volumes of data and information that will be generated and available, often to those less familiar with these databases. The system described here is a first step towards accomplishing efficient and useful access to these vital databases, and goes some way to satisfying recommendation D6 of the Human Variome Project .
Availability and Requirements
Project name: VariVis
Project home page: http://www.genomic.unimelb.edu.au/varivis
Operating system(s): Platform independent
Programming language: Perl
Other requirements: BioPerl and Perl DBI external modules (available via CPAN)
Any restrictions to use by non-academics: Commercial use license can be obtained by contacting the authors
den Dunnen JT, Paalman MH: Standardizing mutation nomenclature: why bother? Human Mutation 2003, 22: 181–182. 10.1002/humu.10262
Scriver CR, Nowacki PM, Lehväslaiho H: Guidelines and recommendations for content, structure, and deployment of mutation databases: II. Journey in progress. Human Mutation 2000, 15: 13–15. 10.1002/(SICI)1098-1004(200001)15:1<13::AID-HUMU5>3.0.CO;2-Y
Cotton RGH, Horaitis O: Quality control on the discovery, reporting and recording of genomic variation. Human Mutation 2000, 15: 16–21. 10.1002/(SICI)1098-1004(200001)15:1<16::AID-HUMU6>3.0.CO;2-S
Claustres M, Horaitis O, Vanevski M, Cotton RGH: Time for a unified system of mutation description and reporting: a review of locus-specific mutation databases. Genome Research 2002, 12: 680–688. 10.1101/gr.217702
Maurer SM, Firestone RB, Scriver CR: Science's neglected legacy. Nature 2000, 405: 117–120. 10.1038/35012169
Patrinos GP, Brookes AJ: DNA, disease and databases: disastrously deficient. Trends in Genetics 2005, 21(6):333–338. 10.1016/j.tig.2005.04.004
Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, Diekhans M, Furey TS, Harte RA, Hsu F, et al.: The UCSC genome browser database: update 2006. Nucleic Acids Research 2006, (34 Database):590–598. 10.1093/nar/gkj144
Staats B, Qi L, Beerman M, Sicotte H, Burdett LA, Packer B, Chanock SJ, Yeager M: Genewindow: an interactive tool for visualization of genomic variation. Nat Genet 2005, 37(2):109–110. 10.1038/ng0205-109
Cotton RGH: Progress of the HUGO Mutation Database Initiative: a brief introduction to the Human Mutation MDI special issue. Human Mutation 2000, 15: 4–6. 10.1002/(SICI)1098-1004(200001)15:1<4::AID-HUMU3>3.0.CO;2-T
Giardine B, Riemer C, Hefferon T, Thomas D, Hsu F, Zielenski J, Sang Y, Elnitski L, Cutting G, Trumbower H, et al.: PhenCode: connecting ENCODE data with mutations and phenotype. Human Mutation 2007, 28(6):554–562. 10.1002/humu.20484
Béroud C, Hamroun D, Collod-Béroud G, Boileau C, Soussi T, Claustres M: UMD (Universal Mutation Database): 2005 Update. Human Mutation 2005, 26(3):184–191. 10.1002/humu.20210
Riikonen P, Vihinen M: MUTbase: maintenance and analysis of distributed mutation databases. Bioinformatics 1999, 15(10):852–859. 10.1093/bioinformatics/15.10.852
Fokkema IFAC, den Dunnen JT, Taschner PEM: LOVD: Easy creation of a locus-specific sequence variation database using an "LSDB-in-a-box" approach. Human Mutation 2005, 26(2):63–68. 10.1002/humu.20201
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucl Acids Res 2006, 34(suppl 1):D16–20. 10.1093/nar/gkj157
Kanz C, Aldebert P, Althorpe N, Baker W, Baldwin A, Bates K, Browne P, Broek A, Castro M, Cochrane G, et al.: The EMBL Nucleotide Sequence Database. Nucl Acids Res 2005, 33(suppl 1):D29–33.
Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JGR, Korf I, Lapp H, et al.: The Bioperl Toolkit: Perl modules for the Life Sciences. Genome Research 2002, 12: 1611–1618. 10.1101/gr.361602
den Dunnen JT, Antonarakis SE: Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. Human Mutation 2000, 15: 7–12. 10.1002/(SICI)1098-1004(200001)15:1<7::AID-HUMU4>3.0.CO;2-N
Cotton RGH, participants of the 2006 Human Variome Project meeting: Recommendations of the 2006 Human Variome Project meeting. Nat Genet 2007, 39(4):433–436. 10.1038/ng2024
Editorial: What is the Human Variome Project? Nat Genet 2007, 39(4):423–423. 10.1038/ng0407-423
The authors wish to acknowledge the help of William Appelbe for his useful discussions and support. This work was supported by the NHMRC (RC) and the Helen Smibert Vacation Scholarship (TS).
TS carried out the programming and software design and drafted the manuscript. RC conceived of the study, and participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.