Combining gene prediction methods to improve metagenomic gene annotation
© Yok and Rosen; licensee BioMed Central Ltd. 2011
Received: 20 July 2010
Accepted: 13 January 2011
Published: 13 January 2011
Traditional gene annotation methods rely on characteristics that may not be available in short reads generated from next generation technology, resulting in suboptimal performance for metagenomic (environmental) samples. Therefore, in recent years, new programs have been developed that optimize performance on short reads. In this work, we benchmark three metagenomic gene prediction programs and combine their predictions to improve metagenomic read gene annotation.
We not only analyze the programs' performance at different read-lengths like similar studies, but also separate different types of reads, including intra- and intergenic regions, for analysis. The main deficiencies are in the algorithms' ability to predict non-coding regions and gene edges, resulting in more false-positives and false-negatives than desired. In fact, the specificities of the algorithms are notably worse than the sensitivities. By combining the programs' predictions, we show significant improvement in specificity at minimal cost to sensitivity, resulting in 4% improvement in accuracy for 100 bp reads with ~1% improvement in accuracy for 200 bp reads and above. To correctly annotate the start and stop of the genes, we find that a consensus of all the predictors performs best for shorter read lengths while a unanimous agreement is better for longer read lengths, boosting annotation accuracy by 1-8%. We also demonstrate use of the classifier combinations on a real dataset.
To optimize the performance for both prediction and annotation accuracies, we conclude that the consensus of all methods (or a majority vote) is the best for reads 400 bp and shorter, while using the intersection of GeneMark and Orphelia predictions is the best for reads 500 bp and longer. We demonstrate that most methods predict over 80% coding (including partially coding) reads on a real human gut sample sequenced by Illumina technology.
Analysis of environmental samples, metagenomic analysis, is defined as the characterization of microbial genomes via the direct isolation of genomic sequences from the environment without prior cultivation . Environmental samples are sequenced using next-generation sequencing technologies which yield short reads lengths (100-500 base pairs) . In traditional analysis, the whole genome of an organism is sequenced and assembled, then genes are predicted along this continuous sequence. In metagenomics, single genomes cannot be assembled and therefore, it is a challenge to predict genes within these short sequences . Accurate gene annotation for environmental samples is needed so that genes can be classified to their correct functions, and it paves the way for functional studies in metagenomics.
Traditionally gene prediction programs can be categorized in two different groups. First, the ab initio programs, which train model parameters on known annotations in order to predict unknown annotations, are widely used in gene prediction . There are a large number of ab-initio gene-finding programs, e.g.: GENIE , GENSCAN , GENEID , GLIMMER , and GeneMark . The second group of gene prediction programs, known as homology-based programs, that align input sequences to the closest homologous sequence in the database to predict genes. Some popular homology-based programs are GENEWISE , AGenDA , and the well-known BLAST . In addition, hybrid approaches that combine the gene prediction programs have been proposed for traditional gene annotation [13–16]. Unfortunately, it is not possible to use traditional gene prediction methods in metagenomics. Applying these conventional approaches to metagenomics is restricted by the identification of open reading frames (ORFs), which begin with a start codon and end with an in-frame stop codon . Usually, genes in prokayrotes are 1000-bp in length on average . Due to the short sequence length of metagenomic reads (under 500-bp from next-generation technology), they contain incomplete ORFs that lack start and/or stop codons, thus conventional ab-initio programs cannot be applied to metagenomics . Similarly, homology-based approaches for gene predictions rely on databases that only contain known, and thus a limited set, of genes. Therefore, both of these categories do not work well for metagenomic fragments which are about 700 bp/less than 400 bp when produced by Sanger and next generation sequencing, respectively .
Therefore, recent tools have emerged to address these problems for metagenomic reads. Three programs are widely used for this purpose: Orphelia , MetaGene(MG) /MetaGeneAnnotator(MGA) , and GeneMark . This paper will benchmark and compare these methods. Then we demonstrate that we can boost specificity drastically by 10% by combining the programs' predictions and overall, improve accuracy by 1-4% while also improving annotation (labeling the start and stop of coding regions) by 1-8%.
Overview of Metagenomic Gene Prediction Programs
GeneMark (GM) 
Obtain the relationships between positional nucleotide frequencies and the global nucleotide frequencies as well as relationships between the amino acid frequencies and the global GC% of the training sequences.
Approximate the obtained relationships by the standard linear regression
Obtain the initial values of frequency of occurrence of each of the 61 codons by calculating the products of the three positional nucleotide frequencies of corresponding nucleotides.
Modify the initial value of codon frequency by the frequency of each amino acid determined by the GC content.
Create a codon usage table for all 61 codons.
Construct the 3-periodic zero order Markov model of a protein coding region using the codon usage table.
The heuristic model was built using these procedures described by using a training data that consists of 357 Bacteria and Archaea species . In order to build a mixture dual model, they are further divided into two sets: (1) 38 Archaea species and 319 Bacteria species; (2) 316 Mesophilic species and 41 Thermophilic species .
Metagene Annotator (MGA) is an upgrade version of another software package, called MetaGene (MG) which is used in gene prediction in metagenomic sequence data. MetaGene predicts genes in two stages. First, all possible ORFs are revealed from the input sequences. Next, all ORFs are scored by their base compositions and lengths using the log-odds scoring scheme. In the log-odds scoring, the frequency of an event observed in ORFs is divided by the observed frequency in random ORFs, and a base-two logarithm of the ratio is used as the score for the event . Second, an optimal (high-scored) combination of ORFs is calculated using the scores of orientations and distances of neighboring ORFs in addition to the scores for the ORFs themselves . However, there are two major limitations that exist in (MG) software program: the lack of ribosomal binding site (RBS) model, and a low sensitivity to atypical genes, whose codon usages are different from those of typical genes . To overcome these limitations and to improve the usability of the program, a new version of the MG called the (MGA) was developed . The MGA has statistical models of prophage genes that enables it to detect lateral gene transfers or phage infections. The MGA also has an adaptable RBS model based on complementary sequences of the 30 tail of 16 S ribosomal RNA which helps it to precisely predict translation starts of genes even when input genomic sequences are short and anonymous sequences. Since the MGA is based on the algorithm of MG, it has logistic regression models of the GC content and the di-codon frequencies (di-codon models) of MG . These features of the MGA remarkably improve prediction accuracies of genes on a wide range of prokaryotic genomes.
Orphelia (Orph) 
Orphelia is a metagenomic ORF finding program for the prediction of protein coding genes in short fragments of DNA sequences with unknown phylogenetic origin. The Orpehelia prediction engine performs gene-finding in two stages. In the first stage, features for monocodon usage, dicodon usage and translation initiation sites are extracted from the ORF sequence using linear discriminants. In the second stage, an artificial neural network combines the sequence features with ORF length and fragment GC-content, and computes a posterior probability of an ORF to encode a protein. Its neural network is trained on randomly excised DNA fragments of a specified length from the genomes that were used for discriminant training.
To improve metagenomic gene prediction and annotation, we first analyze the three leading metagenomic gene prediction programs' sensitivity and specificity to predict whether a read contains a gene. Then, we analyze the upper-bound prediction error of the algorithms, which quantifies the error when all prediction programs mark the read incorrectly. The upper-bound error is much lower than the individual methods, demonstrating that improvements can be made if we combine the predictors. Finally, we analyze different ways of combining the prediction programs to improve prediction accuracy, sensitivity, specificity, and annotation accuracy.
Benchmarking the three gene prediction programs
In conclusion, we note that MGA has the best sensitivity but has the worst specificity for most read lengths. GeneMark has average sensitivity but outstanding specificity for most read lengths, which gives it a better overall f-measure for most read lengths.
Combining the classifiers to improve prediction
We also investigated how the sensitivities and specificities vary for different fragment types. All programs have relatively good sensitivities for type B fragments. We compare the three programs in the supplementary material, in Figures Additional file 3 Additional file 4 and Additional file 5 and show that MGA has the highest sensitivities while Orphelia has the lowest sensitivities for types A, B and C. On the other hand, GeneMark has the best specificities among the three programs for all fragment types and read-lengths, while MGA has the lowest specificities, shown in the supplementary material in Figure Additional file 6. In order to mitigate weaknesses of these programs, we implement the Boolean logical combinations of them to combine the sensitivity vs. specificity trade-off and the Figures also show that the logical combination of GM & Orph has the best specificities. The logical combination of these classifiers shows promising results which we further investigate to find the best performance.
AUC table for ROC analysis
GM & Orph
Combining the classifiers to improve gene annotation
Suggested method (optimizes prediction accuracy and annotation error) vs. Read-length, where annotation accuracy = 100 - annotation error
Best Method (Prediction Accuracy/Annotation Accuracy)
Demonstration on a real dataset, the Human Twin Lean Gut Data
For this analysis, we chose to compare against the best classifier combinations to predict coding regions in 100 bp reads - the Consensus combination and the GM|MGA|Orph. We can see that the GM|MGA|Orph combination method produces the highest gene/non-gene ratio (88% Type A/B/C and 12% for Type D). Secondly, the consensus method which was shown to have the best annotation accuracy, predicts 79% of the sequences as genes, and Orphelia falls between these two predictions with an 84% gene percentage. While the GM|MGA|Orph finds a similar coding/non-coding ratio found in the typical microbial genome [26, 27], there is not sufficient evidence to show that a typical metagenome will represent this ratio. In the future, we plan to examine the coding/non-coding content of metagenomic samples in varying environments.
While this is beyond the scope of this paper, the next step would be to characterize and validate the extra genes discovered using the classifier combination. We propose that these coding regions may have characteristics which make them difficult to identify and may be of potential interest.
We show that performances of programs, GeneMark, MetaGeneAnnotator, and Orphelia, vary for different read lengths and fragment types. The different algorithms result in a trade-off of sensitivity vs. specificity and a gradual decline in these rates for shorter reads. GeneMark's sensitivity and prediction accuracy are lower than those of Orphelia and MGA, while its average specificities are the highest for most read lengths. This is due to GeneMark's ability to correctly predict Type D fragments as non-coding. Also, GeneMark has the lowest annotation error, meaning it is the best in predicting the start and stops of genes, for short read lengths while Orphelia has the lowest annotation error for longer read lengths.
We show that we can improve on these trade-offs by combining the methods' predictions and annotations. In general, the intersection of the methods improves annotation accuracies but at the cost of poor prediction accuracies, while the union of the methods improves predictions accuracies at the cost of poor annotation. We validate the GeneMark, MGA, Orphelia, and the best combinations on a human gut sample sequenced by Illumina technology, and find that GM|MGA|Orph and Orphelia produce the highest coding/non-coding ratios, though more investigations are needed to determine the gene content of metagenomes. In conclusion, the consensus logical combination, or majority vote, has the best performance (optimizing prediction and annotation accuracy) for 100-400 bp while GM&Orphelia has the best performance for 500 bp and longer.
In this work, we benchmark the performance of the three different gene prediction programs: GeneMark, MGA, and Orphelia. To do so, we simulate a dataset composed of coding, non-coding, and partially-coding metagenomic reads. Then, we describe several metrics used to compare the predictors and the parameters input into the programs. Third, we implement Boolean logical combinations of the prediction programs in order to combine their predictions to improve accuracy.
Simulating the Metagenomic Reads
We simulated 2 samples of 28,000 artificial metagenomic fragments from 96 genomes to obtain an average and standard deviation of the figures presented in the paper. The 96 genomes, names are available in Additional file 11 consists of 19 different phyla and represents 14 Archaea species and 70 Bacterial species. If available, 7 species were randomly selected from each phyla, otherwise all the example strains were taken from the phyla. We simulated 4000 reads for each read length (100 bp, 200 bp, 300 bp, 400 bp, 500 bp, 600 bp, and 700 bp) in order to mimic a variety of sequencing technologies. For each read length, we simulated 1000 reads of each of the four different fragment types: Type A) a fragment that contains half of an upstream region and half coding region, Type B) a fragment that contains a fully coding region, Type C) a fragment that half contains the end of a coding region and contains half of a non-coding region, and Type D) a fragment that is fully a non-coding region. The coding/noncoding portions of the fragments were designed using Genbank annotations, which has been known to have errors , but that should not affect the overall rates. We made sure that the upstream and downstream regions in the annotations are purely non-coding regions and have no Genbank gene annotation.
We would like the reader to note that two types of metrics are used to assess the gene prediction programs, and this improves our study compared to previous analyses. First, the sensitivity, specificity, and f-measure is used to determine how well the algorithms predict whether or not there is a gene fragment within a read. Since sensitivity and specificity analyses require a binary detection event, we compute a true positive as detecting a gene anywhere in the read where a gene is expected, a false positive if a gene was predicted in a fully non-coding read, a false negative if a read that contained a gene is predicted as fully non-coding, and true negative as a fully non-coding read that is predicted as non-coding. Therefore, any read that contains a Type A, B, or C fragment should ideally be predicted as a gene. In our analysis, 3000 coding regions are used and 1000 noncoding regions are used. Secondly, in order to determine how well the programs annotate the start and stop of a gene, especially in Type A and C (gene edge) reads, we introduce a new measure, annotation error. The computation of annotation error compares the predicted gene coordinates to the reference gene coordinates, and annotation error is expressed as a percentage which measures how well the programs correctly predict the start/stop (correct annotation) of the gene.
FP, stands for false positive, which is a predicted gene that does not correspond to any gene in the GenBank. And TN stands for true negative, which is non-coding region according to the GenBank and is confirmed to be true by the program.
where Lp stands for the left end index of the gene annotation of the software program. Lgb stands for the left end index of the GenBank's annotation. Rp stands for the right end index of the fragment annotation of the program, while Rgb stands for the right end index of the fragment annotation of the GenBank. And |Fgb| stands for the fragment length according to the GenBank annotation.
Parameters of the Programs
We benchmarked each program in May of 2010. There are required options on web submission windows for some of these programs to give gene prediction. In GeneMark's submission window we used the following setup to examine our samples: For the option of "Kingdom" given on its web page, we used "Mixture of bacteria and archaea". This option should be used for all environmental samples as well as for human and other microbiomes. The option allows for using bacterial and archaeal heuristic models concurrently. It also help in achieving high sensitivity with somewhat lower specificity. For the second option, "Model" we used "Codon Polynomial fitting order 3" for this option no temperature parameters are needed. In the output options we selected "nucleotide" and "HMM" .
Orphelia provides two models for scoring open reading frames in metagenomic reads, Net700 and Net300. Orphelia currently provides two models for scoring open reading frames in sequence fragments, Net700 and Net300. In order to run the web-server, the user needs to specify, which model should be used to predict genes in the input data based on the length of the input fragments. As recommended, we use Net300 for reads that are < = 300 bp and use Net700 for 400bp and longer reads.
MGA does not have required options to run its gene prediction engine. This can be an advantage to the novice user.
Boolean logical combinations of the classifiers
To combine the classifiers, we form nine logical combinations of the three (where & is the AND or Intersection operation and | is the OR or UNION operation):
This work was supported by the National Science Foundation CAREER award #0845827, Department of Energy award DE-SC0004335, and a Department of Education GAANN: Engineering for Pharmaceutical Applications.
- Handelsman J: Committee on Metagenomics: Challenges and Functional Applications. The National Academies Press; 2007.Google Scholar
- Metzker ML: Sequencing technologies - the next generation. Nature Reviews Genetics 2010, 11: 31–46. 10.1038/nrg2626View ArticlePubMedGoogle Scholar
- Hoff KJ, Lingner T, Meinicke P, Tech M: Orphelia: predicting genes in metagenomic sequencing reads. Nucleic Acids Research 2009, (37 Web Server):W101-W105. 10.1093/nar/gkp327
- Stanke M, Waack S: Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 2003, 19(Suppl 2):215–225.View ArticleGoogle Scholar
- Reese MG, Kulp D, Tammana H, Haussler D: Genie-Gene Finding in Drosophila melanogaster. Genome Res 2000, 10(4):529–538. 10.1101/gr.10.4.529PubMed CentralView ArticlePubMedGoogle Scholar
- Burge CB: Identification of genes in human genomic DNA. PhD thesis. Stanford University, Stanford, CA, USA; 1997.Google Scholar
- Parra G, Blanco E, Guigo R: GeneID in Drosophila. Genome Research 2000, 10: 511–515. 10.1101/gr.10.4.511PubMed CentralView ArticlePubMedGoogle Scholar
- Delcher A, Bratke K, Powers E, Salzberg S: Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 2007, 23(6):673–379. 10.1093/bioinformatics/btm009PubMed CentralView ArticlePubMedGoogle Scholar
- Besemer J, Borodovsky M: GeneMarkS: a self training method for prediction of gene starts in microbial genome implication for finding sequence motifs in regulatory regions. Nucleic Acids Res 2001, 29(12):2607–18. 10.1093/nar/29.12.2607PubMed CentralView ArticlePubMedGoogle Scholar
- Birney E, Clamp M, Durbin R: GeneWise and Genomewise. Genome Res 2004, 14: 988–995. 10.1101/gr.1865504PubMed CentralView ArticlePubMedGoogle Scholar
- Taher L, Rinner O, Garg S, Sczyrba A, Brudno M, Batzoglou S, Morgenstern B: AGenDA: homology-based gene prediction. Bioinformatics 2003, 19(12):1575–1577. 10.1093/bioinformatics/btg181View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology 1990, 215: 403–410.View ArticlePubMedGoogle Scholar
- Pavlovic V, Garg A, Kasif S: A bayesian framework for combining gene predictions. Bioinformatics 2002, 18: 19–27. 10.1093/bioinformatics/18.1.19View ArticlePubMedGoogle Scholar
- Yada T, Totoki Y, Takaeda Y, Sakaki Y, Takagi T: DIGIT: a novel gene finding program by combining gene-finders. Pacific Symposium on Biocomputing 2003, 8: 375–387.Google Scholar
- Shah SP, McVicker GP, Mackworth AK, Rogic S, Ouellette BFF: Genecomber: combining outputs of gene prediction programs for improved results. Bioinformatics 2003, 19: 1296–1297. 10.1093/bioinformatics/btg139View ArticlePubMedGoogle Scholar
- Allen JE, Majoros WH, Pertea M, Salzberg SL: JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. Genome Biology 2006, 7(Suppl 1):S9. 10.1186/gb-2006-7-s1-s9PubMed CentralView ArticlePubMedGoogle Scholar
- Xu L, Chen H, Hu X, Zhang R, Zhang Z, Luo ZW: Average Gene Length Is Highly Conserved in Prokaryotes and Eukaryotes and Diverges Only Between the Two Kingdoms. Molecular Biology and Evolution 2006, 23(6):1107–1108. 10.1093/molbev/msk019View ArticlePubMedGoogle Scholar
- Mardis ER: The impact of next-generation sequencing technology on genetics. Trends in Genetics 2008, 24: 133–141.View ArticlePubMedGoogle Scholar
- Noguchi H, Park J, Takagi T: MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res 2006, 34(19):5623–5630. 10.1093/nar/gkl723PubMed CentralView ArticlePubMedGoogle Scholar
- Noguchi H, Taniguchi T, Itoh T: MetaGeneAnnotator: Detecting Species-Specific Patterns of Ribosomal Binding Site for Precise Gene Prediction in Anonymous Prokaryotic and Phage Genomes. DNA Research 2008, 15(6):387–396. 10.1093/dnares/dsn027PubMed CentralView ArticlePubMedGoogle Scholar
- Besemer J, Borodovsky M: Heuristic approach to deriving models for gene finding. Nucleic Acids Res 1999, 27(19):3911–3920. 10.1093/nar/27.19.3911PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu W, Lomsadze A, Borodovsky M: Ab initio gene identification in metagenomic sequences. Nucleic Acids Research 2010, 38(12):e132. 10.1093/nar/gkq275PubMed CentralView ArticlePubMedGoogle Scholar
- Hoff KJ, Tech M, Lingner T, Daniel R, Morgenstern B, Meinicke P: Gene prediction in metagenomic fragments: A large scale machine learning approach. BMC Bioinformatics 2008., 9(217):Google Scholar
- Yok N, Rosen G: Benchmarking of Gene Prediction Programs for Metagenomic Data. 32nd Annual International Conference of the IEEE Engineering in Medicine and Biology Society 2010, 4.Google Scholar
- Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE, Sogin ML, Jones WJ, Roe BA, Affourtit JP, Egholm M, Hensirrat B, Heath AC, Knight R, Gordon JI: A core gut microbiome in obese and lean twins. Nature 2009, 457: 480–484. 10.1038/nature07540PubMed CentralView ArticlePubMedGoogle Scholar
- Taft RJ, Pheasant M, Mattick JS: The relationship between non-protein-coding DNA and eukaryotic complexity. Bioessays 2007, 29: 288. 10.1002/bies.20544View ArticlePubMedGoogle Scholar
- Ahnert SE, Fink TMA, Zinovyev A: How much non-coding DNA do eukaryotes require? Journal of Theoretical Biology 2008, 252(4):587–592. 10.1016/j.jtbi.2008.02.005View ArticlePubMedGoogle Scholar
- Kuncheva L: Combining Pattern Classifiers. Methods and Algorithms. Wiley; 2004.View ArticleGoogle Scholar
- Polikar R: Bootstrap inspired techniques in computational intelligence: ensemble of classifiers, incremental learning, data fusion and missing features. IEEE Signal Processing Magazine 2007, 24: 59–72. 10.1109/MSP.2007.4286565View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.