An application of statistics to comparative metagenomics
© Rodriguez-Brito et al; licensee BioMed Central Ltd. 2006
Received: 24 August 2005
Accepted: 20 March 2006
Published: 20 March 2006
Metagenomics, sequence analyses of genomic DNA isolated directly from the environments, can be used to identify organisms and model community dynamics of a particular ecosystem. Metagenomics also has the potential to identify significantly different metabolic potential in different environments.
Here we use a statistical method to compare curated subsystems, to predict the physiology, metabolism, and ecology from metagenomes. This approach can be used to identify those subsystems that are significantly different between metagenome sequences. Subsystems that were overrepresented in the Sargasso Sea and Acid Mine Drainage metagenome when compared to non-redundant databases were identified.
The methodology described herein applies statistics to the comparisons of metabolic potential in metagenomes. This analysis reveals those subsystems that are more, or less, represented in the different environments that are compared. These differences in metabolic potential lead to several testable hypotheses about physiology and metabolism of microbes from these ecosystems.
Metagenomics describes the functional and sequence-based analysis of DNA isolated from environmental sample without first culturing the associated microbes . Four viral and four prokaryotic shotgun metagenome datasets have been published [2–8]. The acid mine drainage (AMD) metagenome data set was taken from a low complexity environment and includes slightly more than 10 Mb of sequence in 2,455 contiguous sequences (contigs) and ~8,000 predicted protein sequences . The Sargasso Sea metagenome data set is from a more complex environment and includes 788 Mb of sequences in 809,112 contigs, and approximately a million predicted protein sequences .
Number of genomes and protein encoding genes in the SEED database at the time of analysis. The two environmental samples are the Sargasso Sea and Acid Mine Drainage metagenomes.
Number of Genomes
Percent of all proteins
FIG pioneered the use of subsystems to annotate both complete and partial genome sequences . Subsystems are biochemical pathway, fragments of pathways, clusters of genes that function together, or any group of genes that any annotator considers to be related. The subsystems are annotated across genomes by the annotators, providing the most reliable and consistent annotations within and between genomes. The subsystems-based annotations are ongoing and at a given point in time the subsystems represent the snapshot of the best available annotation of the SEED database.
Comparing metagenome samples could lead to the identification of signature functions associated with each metagenome sample, however this analysis requires reliable statistical techniques that are not only robust but are rapid to perform with hundreds of thousands or millions of data points per sample.
Here the Sargasso Sea and AMD metagenomes were compared with the SEED database to identify statistically significant differences in subsystems composition. We hypothesized that there were few barriers to the transfer of subsystems between environments and therefore certain subsystems were enriched by selection in those environments where they were important. We used a difference of medians analysis to identify those subsystems that have a statistically significantly presence in each of the metagenomes. These analyses provide a framework for the statistical comparison of metagenomes.
Determination of statistically significantly different subsystems
A difference between medians calculation was applied to rapidly identify statistically significant differences between metagenomes. This technique has several advantages over other possible statistical methods that could have been applied. For example, the difference between medians is extremely rapid for the calculation of differences between subsystems from different samples, and the method does not depend on the distribution of samples. The source code and step-by-step description of the method are provided as part of the supplemental material [see Additional Files 1 and 2].
Number of samples needed to identifying significant differences between metagenomes
Phylosubsystems that are overrepresented in AMD dataset versus SEED dataset with 99% confidence at a sample size of 145,000 proteins.
Amino Acids and Derivatives
Amino Acids and Derivatives
Amino Acids and Derivatives
Amino Acids and Derivatives
Embden-Meyerhof and Gluconeogenesis
Cofactors, Vitamins, Prosthetic Groups, Pigments
Fatty acid metabolism
Fatty Acids and Lipids
Fatty acid oxidation pathway
Fatty Acids and Lipids
de-novo Purine Biosynthesis
Nucleosides and Nucleotides
de-novo Pyrimidine Biosynthesis
Nucleosides and Nucleotides
Nucleosides and Nucleotides
Ribosome LSU (eukaryotic and archaeal)
Ribosome SSU (eukaryotic and archaeal)
Translation initiation factors (eukaryotic and archaeal)
Subsystem differences between the SEED and Sargasso Sea metagenomes
Fig. 2B highlights some of the differences for phylosubsystems. A more detailed description of all of the subsystems with statistically significant differences in occurrence between environmental data sets is given in the supplemental material [see Additional file 3]. Some ecologically important differences between the Sargasso Sea and the SEED database are discussed below with data extracted from the Supplementary Table.
Potential osmoregulation by amino acids in the Sargasso Sea
Presence of Glycine, Serine, and Threonine subsystems in the AMD, SEED, and Sargasso databases. The table is a subset of the data from the supplemental data [see Additional File 3].
AMD per million8
SEED per million9
SS per million10
K, T, M, and C
Photosynthesis in the Sargasso Sea
As previously observed [6, 7], there was a strong bias towards subsystems involved in photosynthesis in the Sargasso Sea metagenome. This bias includes subsystems for the Calvin-Benson cycle, chlorophyll biosynthesis genes, the cytochrome B6-F complex, Photosystem I, Photosystem II, isoprenoid biosynthesis, and carotenoid biosynthesis.
Some phylosubsystems involved in one-carbon metabolism, including the synthesis and degradation of carbohydrates, cell walls, and capsules are more abundant in the Sargasso Sea. In contrast, the genes for the utilization of complex carbon sources including lactose, arabinose, fructose, mannose, galcitol and inositol are all underrepresented in the marine environment suggesting these are not significant sources of carbon in this environment.
Nucleic acid and phosphate metabolism in the Sargasso Sea
Phylosubsystems involved in purine and pyrimidine de novo synthesis and scavenging pathways, as well as ribonucleotide reduction (scavenging ribonucleotides for DNA synthesis) are more abundant in the Sargasso Sea. Similarly, the subsystems involved in the capture of phosphate via the conversion of ADP to ATP coupled to oxidative phosphorylation are also overrepresented in the Sargasso sample. In contrast, nitrogen metabolism phylosubsystems are less abundant in the Sargasso than the SEED, with the sole exception of ammonia assimilation that is marginally overrepresented in the Sargasso sample at larger sample sizes. The Sargasso Sea has previously been reported to be phosphate limited. The concentration of dissolved inorganic phosphate is approximately 4 nM in the Sargasso Sea. By comparison, the North Pacific and typical bacterial minimal media have phosphate concentrations of approximately 100 nM [18, 19]. Together, these results suggest that phosphate acquisition is critical for microbial growth the Sargasso Sea environment.
Mobility of Sargasso Sea microbes
Estimates of the percentage of bacteria in the ocean that are motile vary from less than 5% to more than 80% [20, 21], and there were far fewer genes encoding flagella in the marine environment compared to the SEED database. However many marine microbes are thought to use alternative, less well characterized, motility systems, such as the motility mechanism characterized in cyanobacteria [22, 23] or twitching motility previously shown in marine microbes . This data leads to the hypothesis that marine microbes are generally not using flagella based motility for movement, and future studies on the genomics of twitching and gliding motility may reveal hints about these mechanisms of movement.
Subsystem differences between the SEED and AMD metagenomes
When the AMD and SEED databases were compared, only phylosubsystems that were in both the AMD and the SEED samples were included. This limited the total number of subsystems that were compared for statistically significant differences. There are far fewer phylosubsystems with significantly different distributions between the AMD and SEED datasets, and phylosubsystems that are significantly more common in the AMD dataset are shown in Table 2. The different occurrences of subsystems reflect the limited complexity of the AMD environment that contains Bacteria and Archaea . The majority of subsystems that are significantly more common in the AMD data set are from archaeal proteins. In the AMD environment, the production of amino acids does not appear to be critical, and only archaeal arginine and histidine degradation and leucine and chorismate synthesis are overrepresented in these samples. Our limited selection of overrepresented subsystems in the AMD sample presumably reflects the current bias in annotated subsystems in the SEED. As the subsystems continue to evolve and expand, and the NIH Project to Annotate 1,000 Genomes  matures the impact of these annotations on the AMD sample and other metagenomes will highlight those areas of metabolism and physiology that are critical to survival in different environments.
Subsystem differences between the Sargasso, SEED, farm and whale metagenomes
The SEED and Sargasso subsystems were compared to both the whale fall and farm metagenome samples . For this comparison the individual whale fall samples, and individual farm samples were each combined to create two separate metagenomes. Those metagenomes were compared to the subsystems exactly as described in Methods, using the BLAST algorithm to determine similar sequences. The data shown in the supplemental material [see Additional file 4] was created using 95% confidence, a sample size of 20,000 proteins, and 20,000 replicates. This table shows each of the comparisons with the statistically significant subsystems. The normalized data was used to determine the relative abundance of each KEGG pathway in each sample , and these comparisons are shown in the supplemental material [see Additional File 5].
The KEGG pathways have historically focused on core metabolism, annotating enzymes that have been classified with EC numbers. In contrast, the SEED subsystems include core metabolism and the data is extended to subsystems that cover cellular processes and functions, regulation, and so forth. Although the two classification techniques are not directly comparable, and statistical confidence was not provided with the differences between KEGG pathways in the supplemental data from the previous analysis, some clear parallels can be seen between these analyses [see Additional File 5]. For example, both techniques identified that riboflavin metabolism is more prevalent in the Whale Fall metagenomes than the other samples, however according to the normalized data from Tringe et al folate biosynthesis is less abundant in the Sargasso metagenome than either the Whale Fall or Soil Metagenomes whereas this analysis demonstrated that there is significantly more folate biosynthesis in the Sargasso than the other samples. There were 9,311 proteins with similarity to folate biosynthesis subsystem from the SEED database in the Sargasso metagenome, 602 proteins with similarity in the farm soil metagenome and 491 proteins with similarity in the Whale Fall metagenome. In contrast, Tringe et al. identified 7,283 proteins, 1,253 proteins, and 889 proteins respectively. These differences are probably due to the difference in annotation of the SEED subsystems and KEGG pathways. These differences also highlight the need for continued careful annotation of genomes, and comparative analysis of different annotation systems and methods.
Community genome sequencing – metagenomics – can provide fine detailed analysis of the metabolism occurring in different ecosystems. However, metagenomics analysis is limited to a purely descriptive science without a rigorous statistical comparison of the prevalence of different genes in different environments. Our analysis demonstrates an application of statistics to identify those areas of metabolism that are significantly over represented in different environments.
The method described here is predicated on the expectation that genes that are more useful in an environment are more commonly found in that environment. Or put another way, there is an enrichment or selection for sets of genes in different environments. A statistical analysis, using a resampling with replacement technique, was developed to generate both the difference in occurrence of each subsystem in each sample, and to generate confidence intervals for the likelihood that these differences are observed by chance. By using these statistical techniques to compare the genetic composition of different environments, the areas of metabolism and biochemistry that are important in a particular environment, in comparison to other environments, can be identified. Like other studies, this analysis demonstrated that microbes in the surface waters of the ocean are much more likely to contain genes involved in photosynthesis than the control data set. The non-redundant database used as a control is not expected to contain large numbers of photosynthetic organisms because it is skewed towards microbial pathogens.
Our analysis also demonstrated more than 150 other subsystems that are over represented in the Sargasso Sea sample when compared to the control set. The skew in the database alone cannot explain this difference, and these subsystems must be important for survival in the ocean. Some examples, such as the synthesis of serine, threonine, and glycine, directly testable hypotheses can be generated from these analyses. For other examples, the explanation of the differences between samples may be more elusive. Several pieces of evidence will assist in determining the roles of different subsystems in different environments. For example, the inclusion of more environmental data with each sample will allude to some of the differences in metabolic potential between samples; the careful dissection of the presence of different subsystems in different organisms will identify which organism in which environment is performing the different biochemical reactions; and the extension of other techniques such as metabolic modelling into the environmental arena may provide insights into the critical biochemical mechanisms in each environment.
Serine, threonine, and glycine betaine are primarily being used as osmoprotectants. Increased intracellular concentrations of serine may protect and against the osmolarity of the ocean. In contrast, sucrose and trehalose are not being used as readily for osmoprotection.
Microbes in the Sargasso Sea are more limited for phosphate than nitrogen.
Microbes in the ocean are not generally using flagella based motility but are probably using one of the less-characterized mechanisms of locomotion.
Archaea in the AMD sample are degrading arginine and histidine.
The subsystems approach to investigating environmental genomes demonstrates the intricate interplay between the abundance of genes in the environment and the biology of that environment. In addition to answering that age-old knock-knock joke  by cataloging the organisms that are present in an environment and looking for novel proteins and structures, metagenomics also provides critical insights into our understanding of the physiology, biology, and ecology of an environment. Using subsystems to compare the ecology of sites that have been sampled by metagenomics can be applied to any other metagenome samples to provide similar insights into the ecology of those environments.
The complete SEED database v4 was used as the source of all data . Construction and annotation of the subsystem database is described elsewhere [9, 28]. The environmental sequences were removed from the SEED database for the analyses presented here. Furthermore, any sequences with principal homology to either Shewanella sp. or Burkholderia sp. were removed from the Sargasso Sea metagenome because of contamination concerns . This dataset contained 960,561 predicted proteins. The AMD data set contained 7,588 predicted proteins. For these analyses the term "protein" is used when referring to predicted proteins.
Assignment of proteins to subsystems and phylosubsystems
Each protein from the AMD, Sargasso, or the SEED database was compared to proteins in the SEED database previously assigned to particular subsystems by the SEED annotators . A protein was considered a member of a subsystem if the protein had significant similarity (designated as an E value less than 1 × 10-20) to another protein previously assigned to a subsystem. Each protein was also classified as Bacteria, Eukarya, and Archaea, based on the Domain assignment of the most similar protein. There were a total of 276 annotated subsystems in the SEED. Bacteria had proteins in 257 of the subsystems. Archaea and Eukarya had proteins in 132 and 134 of the subsystems, respectively. This means that each there were a total of 523 potential data points. The term phylosubsystem is used to reflect that fact that the assignments are based both on the subsystem and Domain.
Comparisons of metagenomic databases
Software to calculate the statistics
A software package that calculates the statistics from appropriately formatted files is provided as supplemental material [see Additional File 1]. The software is released under the GPL license and is available from the BMC website.
The authors thank Matt Cohoon, Sveta Gerdes, Andrei Osterman, Ross Overbeek, Rick Stevens, Olga Vassieva, Veronika Vonstein, Olga Zagnitko, and the other developers and annotators working on the SEED for their invaluable contributions. The authors also thank Mya Breitbart, Ed DeLong, Stanley Maloy, Braudel Maqueira, Ross Overbeek, and Anca Segall, for helpful discussion and critical reading of the manuscript. The Gordon and Betty Moore Foundation Marine Microbiology Initiative grant to FLR and the NSF Biocomplexity Program (NSF0221763) to John Paul (University of Southern Florida) and Anca Segall (San Diego State University) funded this work.
- Riesenfeld CS, Schloss PD, Handelsman J: Metagenomics: genomic analysis of microbial communities. Annu Rev Genet 2004, 38: 525–552. 10.1146/annurev.genet.38.072902.091216View ArticlePubMedGoogle Scholar
- Breitbart M, Felts B, Kelley S, Mahaffy JM, Nulton J, Salamon P, Rohwer F: Diversity and population structure of a near-shore marine-sediment viral community. Proc R Soc Lond B Biol Sci 2004, 271(1539):565–574. 10.1098/rspb.2003.2628View ArticleGoogle Scholar
- Breitbart M, Hewson I, Felts B, Mahaffy JM, Nulton J, Salamon P, Rohwer F: Metagenomic analyses of an uncultured viral community from human feces. J Bacteriol 2003, 185(20):6220–6223. 10.1128/JB.185.20.6220-6223.2003PubMed CentralView ArticlePubMedGoogle Scholar
- Breitbart M, Salamon P, Andresen B, Mahaffy JM, Segall AM, Mead D, Azam F, Rohwer F: Genomic analysis of uncultured marine viral communities. Proc Natl Acad Sci U S A 2002, 99(22):14250–14255. 10.1073/pnas.202488399PubMed CentralView ArticlePubMedGoogle Scholar
- Cann AJ, Fandrich SE, Heaphy S: Analysis of the virus population present in equine faeces indicates the presence of hundreds of uncharacterized virus genomes. Virus Genes 2005, 30(2):151–156. 10.1007/s11262-004-5624-3View ArticlePubMedGoogle Scholar
- Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM: Comparative metagenomics of microbial communities. Science 2005, 308(5721):554–557. 10.1126/science.1107851View ArticlePubMedGoogle Scholar
- Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers YH, Smith HO: Environmental genome shotgun sequencing of the Sargasso Sea. Science 2004, 304(5667):66–74. 10.1126/science.1093857View ArticlePubMedGoogle Scholar
- Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 2004, 428(6978):37–43. 10.1038/nature02340View ArticlePubMedGoogle Scholar
- Overbeek R, Begley T, Butler R, Choudhuri J, Chuang H, Cohoon M, de Crécy-Lagard V, Diaz N, Disz T, Edwards R, Fonstein M, Frank E, Gerdes S, Glass E, Goesmann A, Hanson A, Iwata-Reuyl D, Jensen R, Jamshidi N, Krause L, Kubal M, Larsen N, Linke B, McHardy A, Meyer F, Neuweger H, Olsen G, Olson R, Osterman A, Portnoy V, Pusch G, Rodionov D, Rückert C, Steiner J, Stevens R, Thiele I, Vassieva O, Ye Y, Zagnitko O, Vonstein V: The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes . Nucleic Acids Res 2005.Google Scholar
- Keil RG, Kirchman DL: Utilization of dissolved protein and amino acids in the northern Sargasso Sea. Aquatic Microbial Ecology 1999, 18(3):293–300.View ArticleGoogle Scholar
- Suttle CA, Chan AM, Fuhrman JA: Dissolved free amino-acids in the Sargasso Sea - uptake and respiration rates, turnover times, and concentrations. Marine Ecology-Progress Series 1991, 70(2):189–199.View ArticleGoogle Scholar
- Oren A: Diversity of halophilic microorganisms: Environments, phylogeny, physiology, and applications. Journal Of Industrial Microbiology & Biotechnology 2002, 28(1):56–63.View ArticleGoogle Scholar
- Roesser M, Muller V: Osmoadaptation in bacteria and archaea: common principles and differences. Environ Microbiol 2001, 3(12):743–754. 10.1046/j.1462-2920.2001.00252.xView ArticlePubMedGoogle Scholar
- Sleator RD, Hill C: Bacterial osmoadaptation: the role of osmolytes in bacterial stress and virulence. Fems Microbiology Reviews 2002, 26(1):49–71. 10.1111/j.1574-6976.2002.tb00598.xView ArticlePubMedGoogle Scholar
- Yancey PH, Clark ME, Hand SC, Bowlus RD, Somero GN: Living with water stress: evolution of osmolyte systems. Science 1982, 217(4566):1214–1222.View ArticlePubMedGoogle Scholar
- Galinski EA: Osmoadaptation in bacteria. Adv Microb Physiol 1995, 37: 272–328.PubMedGoogle Scholar
- Mackay MA, Norton RS, Borowitzka LJ: Organic osmoregulatory solutes In cyanobacteria. Journal Of General Microbiology 1984, 130(SEP):2177–2191.Google Scholar
- Deiwick J, Nikolaus T, Erdogan S, Hensel M: Environmental regulation of Salmonella pathogenicity island 2 gene expression. Mol Microbiol 1999, 31(6):1759–1773. 10.1046/j.1365-2958.1999.01312.xView ArticlePubMedGoogle Scholar
- Wu JF, Sunda W, Boyle EA, Karl DM: Phosphate depletion in the western North Atlantic Ocean. Science 2000, 289(5480):759–762. 10.1126/science.289.5480.759View ArticlePubMedGoogle Scholar
- Grossart HP, Riemann L, Azam F: Bacterial motility in the sea and its ecological implications. Aquatic Microbial Ecology 2001, 25(3):247–258.View ArticleGoogle Scholar
- Mitchell JG, Pearson L, Bonazinga A, Dillon S, Khouri H, Paxinos R: Long lag times and high velocities in the motility of natural assemblages of marine-bacteria. Applied And Environmental Microbiology 1995, 61(3):877–882.PubMed CentralPubMedGoogle Scholar
- McCarren J, Heuser J, Roth R, Yamada N, Martone M, Brahamsha B: Inactivation of swmA results in the loss of an outer cell layer in a swimming synechococcus strain. J Bacteriol 2005, 187(1):224–230. 10.1128/JB.187.1.224-230.2005PubMed CentralView ArticlePubMedGoogle Scholar
- Waterbury JB, Willey JM, Franks DG, Valois FW, Watson SW: A cyanobacterium capable of swimming motility. Science 1985, 230(4721):74–76.View ArticlePubMedGoogle Scholar
- Henrichsen J: The occurrence of twitching motility among gram-negative bacteria. Acta Pathol Microbiol Scand [B] 1975, 83(3):171–178.Google Scholar
- Supplemental Data on the String Website[http://string.embl.de/metagenome_comp_suppl/keggmap.detection.frequencies.txt]
- Oremland RS, Capone DG, Stolz JF, Fuhrman J: Whither or wither geomicrobiology in the era of 'community metagenomics'. Nat Rev Microbiol 2005, 3(7):572–578. 10.1038/nrmicro1182View ArticlePubMedGoogle Scholar
- The SEED[http://theseed.uchicago.edu/FIG/index.cgi]
- Meyer F, Goesmann A, McHardy AC, Bartels D, Bekel T, Clausen J, Kalinowski J, Linke B, Rupp O, Giegerich R, Puhler A: GenDB--an open source genome annotation system for prokaryote genomes. Nucleic Acids Res 2003, 31(8):2187–2195. 10.1093/nar/gkg312PubMed CentralView ArticlePubMedGoogle Scholar
- DeLong EF: Microbial community genomics in the ocean. Nat Rev Microbiol 2005, 3(6):459–469. 10.1038/nrmicro1158View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.