Functional analysis of metagenomes and metatranscriptomes using SEED and KEGG
© Mitra et al; licensee BioMed Central Ltd. 2011
Published: 15 February 2011
Skip to main content
© Mitra et al; licensee BioMed Central Ltd. 2011
Published: 15 February 2011
Metagenomics is the study of microbial organisms using sequencing applied directly to environmental samples. Technological advances in next-generation sequencing methods are fueling a rapid increase in the number and scope of metagenome projects. While metagenomics provides information on the gene content, metatranscriptomics aims at understanding gene expression patterns in microbial communities. The initial computational analysis of a metagenome or metatranscriptome addresses three questions: (1) Who is out there? (2) What are they doing? and (3) How do different datasets compare? There is a need for new computational tools to answer these questions. In 2007, the program MEGAN (MEtaGenome ANalyzer) was released, as a standalone interactive tool for analyzing the taxonomic content of a single metagenome dataset. The program has subsequently been extended to support comparative analyses of multiple datasets.
The focus of this paper is to report on new features of MEGAN that allow the functional analysis of multiple metagenomes (and metatranscriptomes) based on the SEED hierarchy and KEGG pathways. We have compared our results with the MG-RAST service for different datasets.
The MEGAN program now allows the interactive analysis and comparison of the taxonomical and functional content of multiple datasets. As a stand-alone tool, MEGAN provides an alternative to web portals for scientists that have concerns about uploading their unpublished data to a website.
Metagenomics seeks to understand microbial communities by DNA sequencing. Deeper sequencing and better reference databases are advancing the potential and success of such analyses. While metagenomics provides information on the gene content of a microbial community, metatranscriptomics promises to reveal the actual metabolic activities of this community at a specific time and place, and how those activities change in response to environmental forces or biotic interactions.
A number of different systems and resources for metagenome or similar analysis, which are offered in the form of databases, web portals, web services and basic stand-alone programs [1–11]. These resources are mainly focused on the analysis of individual metagenomes and currently do not have the capacity for rapid and highly interactive comparison of multiple datasets. Furthermore, many of these resources are suitable only for taxonomic analysis. In our experience, only the MG-RAST web server [1, 11] currently provides a readily useable service for analyzing a new metagenomic dataset. However, while web portals are attractive because they offer large computational resources for data analysis, some scientists have concerns about uploading their unpublished data to a website. To address this problem of taxonomic analysis, the program MEGAN  was published in 2007, as the first stand-alone interactive tool for analyzing the taxonomic content of a dataset. A subsequent version of the program was developed that allows one to compare the taxonomic content of different datasets [13, 14]. MEGAN is easy to install and use, and requires only a BLAST output file as input to operate. The program is designed to allow both high-level analysis that summarizes data at different ranks of the NCBI taxonomy, and detailed analysis that drills down to individual reads and their BLAST matches. The goal of this paper is to describe new features of MEGAN that allow the functional analysis of a microbial community. This type of analysis can assist in understanding biochemical processes or in estimating the influence of environmental changes on biospheres. The next major release of MEGAN allows the functional analysis of metagenomic and metatranscriptomic datasets using the SEED classification, based on the given BLAST file. For comparative purposes, one can simultaneously map multiple datasets onto the SEED hierarchy and also compute distance matrices on datasets based on their SEED content.
For the pathway analysis, the field of systems biology already possesses a high-quality database, namely the Kyoto Encyclopedia for Genes and Genomes (KEGG) . MEGAN provides a KEGG analysis window that reports which KEGG pathways are present in a dataset and allows one to then inspect these pathways; for example, MEGAN can capture all reads that are mapped to a given pathway of interest.
MEGAN performs a taxonomic analysis of a dataset by mapping reads onto different taxa in the NCBI taxonomy, depending on the phylogenetic footprint of the gene that a read contains (using the lowest common ancesstor algorithm, as described in ). The result is displayed as a rooted tree where the nodes represent the different taxa and are scaled and labeled by the number of reads assigned to the taxon.
Now, as a new feature, MEGAN uses the SEED classification  for functional analysis. In this classification, genes are assigned to functional roles and different functional roles are grouped into subsystems. The SEED classification can be represented by a rooted tree where the internal nodes represent the different subsystems and where the leaves represent the functional roles. Note that the tree is “multi-labeled” in the sense that different leaves may represent the same functional role, if a role occurs in different subsystems. The current SEED tree has about 10,000 nodes.
To perform a KEGG analysis, MEGAN attempts to match each read to a KEGG orthology (KO) accession number, using the best hit to a reference sequence for which a KO accession number is known. MEGAN then calculates the number of hits to each KEGG pathway and reports these numbers to the user. The user can request to see the hits to a given pathway and an appropriate image of the pathway is generated by coloring the pathways based on the KEGG mapping. MEGAN allows one to analyze several datasets together, using different colors to show which parts of a pathway are present in which datasets. Because different genes that are present in different organisms in a consortium of microbes will often not operate together in a single pathway, MEGAN allows one to restrict the pathway analysis to a set of one or more taxa in the NCBI taxonomy .
We have performed pathway analyses for pooled DNA samples from the PLM-Bergen datasets. To compare MEGAN against MG-RAST, we have concentrated on two different pathways, namely the ‘citrate cycle (TCA cycle)’ and ‘photosynthesis’.
These results show the advantages of MEGAN’s KEGGviewer for KEGG-based functional analysis.
Here we have presented the functional assignment module of MEGAN. Using two examples, we show that MEGAN and SEED obtain comparable results to MG-RAST. With this new version of MEGAN, researchers can perform a functional analysis using the SEED classification. Because MEGAN performs this analysis directly from the BLAST input file, no additional calculations are required. Thus, MEGAN provides a stand-alone alternative to the MG-RAST server. In future work, we plan to integrate MEGAN into the MG-RAST portal. This will bring together the highly interactive features of MEGAN and the computational power of MG-RAST together.
For the first case study, we used a small subset consisting of 1408 sequences from an FLX-titanium pyrosequencing dataset (Roche-454) obtained from a hydrothermal vent microbial community. We used eight marine datasets from Plymouth Marine Laboratory, consisting of four metagenomes (DNA) and four metatranscriptomes (cDNA) from a mesocosm experiment performed in Bergen, Norway (see  for details). In this paper, we refer to these as the PLM-Bergen datasets. All metagenomes and metatranscriptomes were aligned against the NCBI-NR database using the BLASTX tool . The results were then imported into MEGAN  (with default parameters), using the ‘Import from BLAST’ option and saved as MEGAN own ‘rma files’.
MEGAN places each read of a given dataset onto one of the taxa (or “nodes”) of the NCBI taxonomy, based on the BLAST matches provided for the read, using the LCA algorithm. For functional assignment in a similar fashion, MEGAN provides a hierarchical representation using the SEED classification. Normally, reads are mapped to the NCBI taxonomy, and the program provides the exact numbers of reads assigned to any given node and the number of hits to any nodes in the subtree rooted at the node. In a similar fashion for the functional analysis, reads are mapped to SEED subsystems using the ‘seed2ncbi.gz’ file from the SEED server.
After opening all the samples in MEGAN, we compared the taxonomic content using the ‘Compare’ menu item to obtain the comparison in a new window. Choosing the ‘SEED’ menu from the comparison window allowed us to get a functional comparison of the samples directly in a new window (Figure 2).
To compare six PLM-Bergen marine samples, based on their functional content, we chose ‘Networks’ from the ‘Option’ menu of MEGAN’s SEEDviewer to see the network comparison view of multiple datasets that are under consideration (Figure 3). Moreover, one has six different choices of distance measures to compute the networks (see  for details).
MG-RAST is a leading service for functional annotation. To test the functional assignment of MEGAN, we compared MEGAN’s functional assignment with MG-RAST’s assignment in two datasets which are described using two case studies below.
We first performed MEGAN-SEED annotations on an unpublished dataset of 1408 sequences (454 FLX-titanium pyrosequencing) of a metatranscriptome obtained from a hydrothermal vent microbial community and compared them with the parallel assignments obtained by MG-RAST. We then performed the same study with a published marine dataset (the Time1-Bag1-DNA sample from the Bergen datasets; 209, 073 sequences) and compared the results with MG-RAST’s assignment.
MEGAN includes a module called ‘KEGGviewer’ for the analysis of metagenomic data in the context of pathways. It is designed to consume a list of RefSeq accession numbers and maps them to KEGG orthologies (KO numbers). The functionality of this program has been compared to MG-RAST, which also provides basic pathway analysis methods for some of the above mentioned PLM-Bergen marine samples by selecting different metabolic pathways.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S1.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.