EGenBio: A Data Management System for Evolutionary Genomics and Biodiversity
© Nahum et al. 2006
Published: 26 September 2006
Skip to main content
© Nahum et al. 2006
Published: 26 September 2006
Evolutionary genomics requires management and filtering of large numbers of diverse genomic sequences for accurate analysis and inference on evolutionary processes of genomic and functional change. We developedEvolutionaryGenomics andBiodiversity (EGenBio;http://egenbio.lsu.edu) to begin to address this.
EGenBiois a system for manipulation and filtering of large numbers of sequences, integrating curated sequence alignments and phylogenetic trees, managing evolutionary analyses, and visualizing their output.EGenBiois organized into three conceptual divisions,Evolution,Genomics, andBiodiversity. TheGenomicsdivision includes tools for selecting pre-aligned sequences from different genes and species, and for modifying and filtering these alignments for further analysis. Species searches are handled through queries that can be modified based on a tree-based navigation system and saved. TheBiodiversitydivision contains tools for analyzing individual sequences or sequence alignments, whereas theEvolutiondivision contains tools involving phylogenetic trees. Alignments are annotated with analytical results and modification history using ourPRAEDformat. A miscellaneousToolssection andHelpframework are also available.EGenBiowas developed around our comparative genomic research and a prototype database of mtDNA genomes. It utilizes MySQL-relational databases and dynamic page generation, and calls numerous custom programs.
EGenBiowas designed to serve as a platform for tools and resources to ease combined analysis in evolution, genomics, and biodiversity.
Large-scale genomic technologies have generated an extraordinary amount of data in the past few decades. Consequently, a huge effort has been made toward creating biological databases and systems to organize, analyze, and share information with the world-wide community[1–4]. The application of genomic technologies to molecular evolution has opened new frontiers in the interdisciplinary field of evolutionary genomics, and this has given rise to a great potential to elucidate complex questions in biology[2, 5]. Understanding of evolutionary processes is critical, since they determine the sequence, structure, and function of macromolecules, and ultimately shape the higher-level biological complexity of organisms.
Genomic biodiversity has been defined as dense sampling of molecular data from diverse taxonomic groups for large genomic regions or complete genomes, and inferences concerning evolutionary processes are greatly improved by adopting combined molecular and computational approaches that include a large amount of genomic biodiversity[6–9]. We have found that the study of evolutionary genomics in the context of a dense sampling of species gives rise to many unique data processing problems, and so have developed theEvolutionaryGenomics andBiodiversity (EGenBio) project as a web-based system to simplify large-scale evolutionary data management.
The central aim ofEGenBiois to provide integrated analysis and visualization of raw sequence data, alignments, and phylogenetic trees, to rapidly curate and annotate that data, and to filter that data based on these annotations for further analysis of specific genomic contexts. It is designed to be robust to change and easily extensible to other datasets and other analytical programs. To accomplish this,EGenBiohas web-based interfaces designed to: (1) access computational tools for phylogenetic and evolutionary analyses; (2) facilitate the construction of large-scale sequence and alignment datasets across diverse taxa; and (3) provide a framework for comparative analysis of diverse genes and genomes.EGenBiomay also serve to promote the utility of increases in the scale of genomic biodiversity.
Custom tools* currently in the main divisions ofEGenBio
Search the mtDNA genome database for DNA or protein sequences
Search the mtDNA genome database for sequence alignments
Search the human mtDNA database for DNA or protein sequences
Search the human mtDNA database for sequence alignments
Translate labels of a tree file
Visualize filters associated with alignments
Extract tree clusters along with information on branch lengths
Visualize results from saturation mutagenesis MCMC analysis
Detect coevolution among residues using LRTs and trees
List species currently inEGenBio
Search species by taxonomic group or NCBI genome identifier
Display mitochondrial gene order for specified taxa
Provide information about theEGenBiodatabases
Generate permutations for use in primer design
Produce degenerate primers that reflect amino acid variation
In addition to the three main sections,EGenBiocontains aToolssection that serves as a repository for small stand-alone tools that may also exist as components of other pages, or which serve other simple purposes. TheHelplink leads not to a separate section, but rather to what we will call a separate "framework". The structure of theHelpframework mirrors the main framework exactly, but instead of linking to actual tools,Helppages link to detailed descriptions and documentation for each page. Invisible to most users, a hiddenDesignframework allows for rapid editing and movement of page and site structure information from design to laboratory testing stages, and finally to public access.
EGenBiois a web-based system for analysis in evolutionary genomics and biodiversity. It provides tools and resources for quickly creating, modifying, and analyzing large alignment datasets in ways that we have found useful in our own computer-based and experimental evolutionary genomics research. Our prototype database of complete vertebrate mitochondrial genomes represents the densest complete set of genes currently available from closely related organisms. It can be accessed flexibly according to comma-separated queries or a phylogenetic tree navigation system.EGenBiois designed to be easily extensible to use with other protein complexes and other analytical programs. Our goal is to incorporate and utilize as many existing programs as possible, and to develop only "added value" programs. In the current public version, all tools are novel to our system except that alignments are created using ClustalW. ThePRAEDalignment annotation system based on data filters allows alignments to be modified easily according to user interest in annotation features, and allows for the results of analyses to be returned as further annotations on the alignments. Since it is derived from theNEXUSformat, it is easy to add batch commands to direct analyses using many common phylogenetic analysis programs. ThePRAEDformat and data filters are a unique feature of theEGenBiosystem.
Future modules under development inEGenBioinclude the creation of additional data filters, incorporation of more genes for analysis of functional divergence, development of further visualization tools for statistical analyses of evolutionary dynamics, and automated procedures for analysis using existing programs and tools. We also welcome feedback from the scientific community on areas of general need for integrated evolutionary genomics tools.EGenBiois publicly available and can be accessed athttp://egenbio.lsu.edu/ via anonymous login. User accounts that allow users to save search parameters and results are provided upon request. Incorporation and private access to pre-publication data can also be accommodated upon request. Replication of theEGenBiosystem would require a Linux-based operating system capable of running Perl, Perl-GD, R, PHP, MySQL, and an Apache web server. It would also require installation of numerous custom scripts in addition to ClustalW.
Likelihood ratio test
Markov chain Monte Carlo
National Center for Biotechnology Information
This work was partly funded by the National Institutes of Health (R22/R33 Innovation and Development grant to David Pollock), the National Science Foundation (CBM2/EPSCOR), and the State of Louisiana (Biological Computation and Visualization Center, Governor's Biotechnology Initiative, and startup funds to David Pollock). We also anonymously thank other current and former members in the Pollock laboratory for assisting in the development and testing of various tools, and thank Chad Jarreau, Jonathan Bonin, Jonny Roberts Jr., Patricia Ledwig, Stephen McCullough, Sujatha Muralidharan, and Yonatan Platt for contributing to theBiodiversityimage collection.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.