The MG-RAST server is an open source system based on the SEED framework for comparative genomics [4, 5]. Users can upload raw sequence data in fasta format; the sequences will be normalized and processed and summaries automatically generated. Genome annotation systems are ever evolving; therefore, in order to accommodate new methods that may be developed, the pipeline was designed with a modular framework that allows the rapid addition of new analysis steps or comparative data at any stage of the analysis. The server provides several methods to access the different data types, including phylogenetic and metabolic reconstructions, and the ability to compare the metabolism and annotations of one or more metagenomes and genomes. In addition, the server offers a comprehensive search capability. Access to the data is password protected, and all data generated by the automated pipeline is available for download and analysis in variety of common formats. Here we describe the key components of the pipeline, which are summarized in Figure 1.
User Registration and Management
The user registration serves two functions: to limit access to each data set to the user and their colleagues and to secure a valid email address in case correspondence is required, for example if a data-processing problem occurs. Once logged in, users can view their own metagenomes, those to which the owner has granted them rights, and the default set of publicly available metagenomes. The system supports delegation of authorization so that users can allow others to access one or more of their metagenomes. In addition, data owners can release their metagenomes to the public at any point, allowing all users of the system to view their data.
Data Types
The pipeline accepts data in a number of formats: 454 reads may be uploaded directly in the format delivered by 454 [6], and fasta files typical of Sanger-sequences and used by other platforms may also be uploaded. The pipeline will also accept assembled sequences in fasta format. Sequence data may be compressed by one of several common computer programs to speed upload.
Users may choose to upload raw unassembled reads or assembled contigs. As discussed below, each approach has advantages and disadvantages. Users with a limited number of larger contigs, where the average contig length exceeds 40 kb, should consider using the RAST server for the analysis of complete Bacterial and Archaeal genomes [7].
The Genomics Standards Consortium has proposed a minimal set of data, called the Minimum Information about a Genome Sequence (MIGS) [8], that should be collected with every metagenome sequence. Although this is an evolving standard, the metagenomics-RAST server is MIGS-compliant. Metadata, accessory data about the metagenome (e.g., date and location where the sample was collected), is requested from the user at the time of sequence submission. This data is stored with the user's data and can be provided to the GSC genome catalogue, and other archives, when the sequence data is ready for public release.
Implementation and Core Analyses
The pipeline is implemented in Perl by using a number of open source components, including the SEED framework [4], NCBI BLAST [9], SQLite, and Sun Grid Engine [10] as components. The system also uses the publicly available SEED subsystems, SEED nr, and FIGfam protein families (see http://www.theSEED.org).
The distinct steps are implemented to provide a flexible, extensible processing pipeline. The steps incrementally add data to a self-contained "job directory" that contains all job-relevant data in flat file and SQLite [11] format. Relational database technology is used to efficiently provide a mapping of sequences in a metagenome to both organisms and metabolic functions and at the same time allow the user to change the parameters for the underlying sequence matches. The user interface enables the download of the user's job directories, and a future version of the software will allow uploading of user-created directories into the server.
After uploading the data, a normalization step (see Figure 2) is executed, generating unique internal IDs and removing exactly duplicate sequences from 454 data sets. (These sequences are an artefact of the sequencing technique and are not scientifically meaningful [12].)
In the second step, the sequences are screened for potential protein encoding genes (PEGs) via a BLASTX [9] search against the SEED comprehensive nonredundant database sourced from the INSDC databases, sequencing centers, and other sources [4]. An expect value (E) cut-off of 0.01 is used to pick up potentially coding elements. (This was chosen empirically to increase the number of potentially coding elements while not being overwhelming for data analysis.) In parallel with the BLASTX searches, the sequence data is compared to all accessory databases by using the appropriate algorithms and significance selection criteria. These databases include several rDNA databases, including GREENGENES [13], RDP-II [14], and the European 16S RNA database [15], and boutique databases such as the chloroplast database, mitochondrial database, and ACLAME database of mobile elements [16]. The search criteria are specific for each database. For example, screens for ribosomal RNA genes are performed by using BLASTN against the rDNA databases, but much more stringent selection criteria are used to identify candidate RNA genes than for identifying protein-encoding genes (by default, the similarity must exceed 50 bp in length and have an expect value less than 1 × 10-5).
In the third step, these matches to external databases are used to compute the derived data. First, a phylogenomic reconstruction of the sample is computed by using both the phylogenetic information contained in the SEED nr database and the similarities to the ribosomal RNA database. Functional classifications of the PEGs are computed by projecting against SEED FIGfams [17] and subsystems based on these similarity searches [4]. These functional assignments become the raw input to an automatically generated initial metabolic reconstruction of the sample, providing suggestions for metabolic fluxes and flows, reactions, and enzymes.
One of the design goals of this server was easy accessibility via a web-based interface. The interface provides views for browsing and analysis of the data, as well as a means to download all result tables and the sequences for every subset displayed. Figure 3 provides an overview of the various elements of the user interface and highlights the options for downloading various subsets. The user interface provides a means to alter some of the parameters used to compute the functional, metabolic, and phylogenetic reconstruction. This allows more stringent match criteria (e.g., expectation value, score, overall percent identity, length of match, and number of mismatches); and, by restricting the matches, the derived data is dynamically changed. The default parameters have been chosen by empirical testing and represent a tradeoff between accuracy and specificity.
Comparative Metagenomics
The abundance of comparative metagenomics tools is central to the utility of the mg-RAST platform. Various tools have been built into the framework, allowing users to compare their data against other metagenomes or complete genomes taken from the SEED [4] environment. The subsystems heat map and the taxonomic heat map provide comparative metagenomics summaries that encapsulate the differences between samples.
The subsystem comparison tools identify the number of pegs in each metagenome that are connected to a subsystem via protein level similarity. Based on these connections, each subsystem present in a sample is scored by counting the number of sequences that are similar to a protein in each subsystem. This score is divided by the total number of sequences from the sample that are similar to any protein in a subsystem, to give a fraction of sequences in subsystems that are in a given subsystem. This approach allows comparisons between samples that have different numbers of sequences. Since the fractions tend to be small (a few sequences hit each subsystem, but there are now over 600 subsystems in the SEED), the scores can be factored for display purposes. Furthermore, a nonquantitative approach is provided to group the subsystem scores, emphasizing those subsystems that are most different between the samples. Moreover, the display can be limited to specific areas of metabolism, or other subsystem groups, as desired by the user.
The taxonomic heat map works in an analogous fashion but highlights the different taxonomic profiles in each sample, as determined by the phylogenetic or phylogenomic approaches selected by the end user (e.g., 16S comparisons, phylogenomics from BLAST results). Again, samples may be grouped in a nonquantitative fashion to rapidly highlight particular phylogenetic groups that predominate in different samples.
Often a metagenome comprises a few dominant organisms, and many of the pathways in the metagenome can be predicted. The automatically generated metabolic reconstructions can be compared to any given metagenome or complete microbial genome. This approach highlights subsystems that are unique to a metagenome, a comparative genome, or the subsystems common to both. With these tools, users can identify shared metabolism present in their samples.