Gene prediction in metagenomic fragments based on the SVM algorithm
© Liu et al.; licensee BioMed Central Ltd. 2013
Published: 10 April 2013
Metagenomic sequencing is becoming a powerful technology for exploring micro-ogranisms from various environments, such as human body, without isolation and cultivation. Accurately identifying genes from metagenomic fragments is one of the most fundamental issues.
In this article, we present a novel gene prediction method named MetaGUN for metagenomic fragments based on a machine learning approach of SVM. It implements in a three-stage strategy to predict genes. Firstly, it classifies input fragments into phylogenetic groups by a k-mer based sequence binning method. Then, protein-coding sequences are identified for each group independently with SVM classifiers that integrate entropy density profiles (EDP) of codon usage, translation initiation site (TIS) scores and open reading frame (ORF) length as input patterns. Finally, the TISs are adjusted by employing a modified version of MetaTISA. To identify protein-coding sequences, MetaGun builds the universal module and the novel module. The former is based on a set of representative species, while the latter is designed to find potential functionary DNA sequences with conserved domains.
Comparisons on artificial shotgun fragments with multiple current metagenomic gene finders show that MetaGUN predicts better results on both 3' and 5' ends of genes with fragments of various lengths. Especially, it makes the most reliable predictions among these methods. As an application, MetaGUN was used to predict genes for two samples of human gut microbiome. It identifies thousands of additional genes with significant evidences. Further analysis indicates that MetaGUN tends to predict more potential novel genes than other current metagenomic gene finders.
Thousands of prokaryotes have been cultivated and sequenced to explore the extent of biological diversity of the microbial world . However, studies based on 16S ribosomal RNA approaches estimate that only a small fraction of the living microbes can be easily isolated and cultivated in laboratory conditions, thus single genome sequencing is not applicable for the majority of microbial species [2, 3]. It means that the current knowledge of genomic data is highly biased and do not represent the true picture of the microbial species . In addition, single genome sequencing ignores the interactions such as coevolution and competition between organisms living in the same habitats, which fail to reveal the real state of microbial organisms in nature.
These limitations can be circumvented by metagenomics, a methodology for studying microbial communties by directly sampling and sequencing shotgun DNA fragments from their natural environments without prior cultivation . It is becoming a powerful method to reveal genomic sequences from organisms in natural environments, especially for communities resided in or on human bodies that are closely related to human health. With the evolutionary development of sequencing technologies, DNA sequences can be produced at much higher throughput with much lower prices than before. So far, hundreds of samples from various environments, such as, acid mine drainage , Sargasso sea , Minnesota soil  and human gut microbiome [9–11] have been sequenced by traditional Sanger sequencing and the next-generation sequencing (NGS) technologies like Roche454 and Illumina.
Accurate gene prediction is one of the fundamental steps in all metagenomic sequencing projects. However, it is more complicated in metagenomes than in isolated genomes. Firstly, most fragments are very short. Many sequences in metagenomic sequencing projects remain as unassembled singleton reads or short-length contigs. Therefore, lots of genes are incomplete with one or two ends exceed the fragments, which is not a problem in complete genomes. Also, a single fragment usually contains only one or two genes, non-supervised methods for single genomes which require an adequate number of genes for model training are inapplicable for this situation . Secondly, the anonymous sequence problem, which means the source genomes of the fragments are always unknown or totally new [13, 14], brings challenge on statistical model construction and feature selection.
Two types of approaches are commonly used for predicting genes from metagenomic DNA fragments. One is the evidence-based method that relies on homology searches. It includes comparisons against known protein databases by BLAST packages, CRITICA  and Orpheus . Usually, it is able to infer functionalities and metabolic pathways of the predicted genes via significant targets with a high specificity if the threshold is stringent. However, only the genes with previously known homologs can be predicted by this means, while the novel genes, which are very important to metagenomic studies, will be overlooked. Therefore, ab initio algorithms that can present much higher sensitivity along with sufficient high specificity are indispensible.
Despite the anonymous and short fragmentary nature of sequences, several ab initio methods have been specially designed for metagenomic fragments in recent years [12–14, 17–20], reporting that the performance on 3' end of genes is comparable with it on single genomes. Most of these previous methods based on modeling sequences in a Markov architecture of various orders. For example, MetaGeneMark incorporates a hidden Markov model to depict the dependencies between the frequencies of oligonucleotides with different length and the GC% of a nucleotide sequence by using direct polynomial and logistic approximations. It is found that the fifth-order Markov model obtained by logistic regression of hexamer frequencies performs the best . Glimmer-MG was developed based on the Glimmer framework, which uses the interpolated Markov models with variable-order for capturing sequence compositions of protein-coding genes . Orphelia is a recently proposed metagenomic gene finder based on the machine learning approach that by pass the Markov model . It integrates mono-codon and di-codon usage, sequence patterns around TISs, ORF length and GC content into an artificial neural network to estimate the probability of an ORF to be protein-coding.
To overcome the anonymous sequence problem, MetaGene and MetaGeneMark train separate models for Archaea and Bacteria as studies have shown that the dependency patterns of oligonucleotides from GC content are different in the two domains of life [12, 19]. An incoming fragment will be predicted by both models and the one with the higher score is chosen. In MetaProdigal, current complete genomes are firstly classified into 50 clusters according to the gene prediction similarity of Prodigal training files. Then, these clusters are used for learning another 50 training files for gene prediction in metagenomic fragments. A given fragment will be scored by the training files within a range of its GC content . Glimmer-MG reported that the integration of sophisticated classification and clustering schemes based on interpolated Markov models to parameterized gene prediction models produces much better results than using GC-content . In one of our previous works, MetaTISA introduced a k-mer method for binning sequences before TIS relocating. It also works well to achieve substantial improvement for TIS prediction . In this article, we present a novel gene prediction method MetaGUN for metagenomic fragments based on a machine learning approach of support vector machine (SVM). Three sets of statistics are integrated to depict the coding potential for a candidate ORF, the EDP of codon usage, the TIS scores and the ORF length. The triplet nucleotides pattern is one of the most important statistic properties for discriminating protein-coding sequences from non-coding DNA. Different from most of the current metagenomic gene finders, MetaGUN describes the codon usage of ORFs by using an EDP model instead of the Markov model. The EDP model was used to measure the coding potential of ORFs based on the amino acids usage for single genomes in our previous works [22, 23]. To be more sophisticated, the EDP model is extended to base on the codon usage for metagenomic fragments. Sequence patterns around TISs are also important signatures that can improve gene prediction performance [13, 18, 23]. In this work, we implement a TIS scoring strategy based on hundreds of precomputed TIS parameters trained by the TriTISA program to get the TIS scores for a given ORF . The length of an ORF is the third integrated feature that has been reported to be another important measure for distinguishing genes from random ORFs in both isolated and metagenomic genomes . Recently, special efforts have been made in predicting correct TISs by some current metagenomic gene finders with substantial achievements [13, 14]. In MetaGUN, an upgraded version of MetaTISA is employed for adjusting the TISs for predicted genes. To identify protein-coding sequences, MetaGun builds two gene prediction modules, the universal module and the novel module. The former is based on 261 prokaryotic genomes representatively covering a wide range of phylogenetic clades, genomic GC content and varied living environments. The latter is designed to find potential functionary DNA sequences with conserved domains.
MetaGUN is freely available as open-source software from http://bioinfo.ctb.pku.edu.cn/MetaGUN/ under the GNU GPL Licenses.
Materials and methods
Genomic data and annotations of 261 complete genomes (229 bacteria and 32 archaea) are obtained from NCBI RefSeq database for training the supervised SVM classifiers and the fragments classification model. 12 species (9 bacteria and 3 archaea) used in previous methods are also chosen for evaluating the prediction performance here [12, 18]. Since the genomes of the 12 species are included in the training set, it is worth noting that we excluded them from the training data when assessing the performance on these genomes. The 6 genomes with experimentally characterized gene starts are used for evaluating TISs accuracy . Two samples of human gut microbiome are used for investigating novel gene discovery ability of current methods . Genomic sequences and corresponding annotations of them are obtained from IMG/M website.
Architecture of MetaGUN algorithm
To predict genes, MetaGUN runs in three stages. Firstly, a k-mer based naïve Bayesian sequence binning method is employed to assign all incoming fragments into phylogenetic groups just like in our previous work MetaTISA . In MetaGUN, it is worth noting that fragments are assigned into both the genus level and the domain level (Archaea and Bacteria). The former is used for supervised TIS scoring parameters selection and TIS prediction, and the latter is applied to determine the SVM classifiers for gene prediction. Secondly, all possible ORFs (complete and incomplete) are extracted from the fragments and scored by their feature vectors with SVM classifiers of supervised universal prediction module and sample specific novel prediction module for each domain independently. That is, a regressive probability is assigned to an ORF depending on its distance from the separating hyperplane in the feature space of the SVM classifier . The ORF with a probability larger than the given threshold is regarded as protein-coding. Finally, a modified version of MetaTISA is used to relocate the TISs of all predicted genes to obtain high quality TIS annotations.
Since fragments in metagenomes can originate from diverse species, one of the most challenges is how to train statistical models that can properly capture features of sequences from different source genomes. Moreover, the short nature of metagenomic fragments further complicates this problem. Most published gene finders for metagenomes incorporate a sequence classification procedure implicitly or explicitly. For example, MetaGene and MetaGeneMark train separate models for two domains. Since they are based on the Markov model, input sequences are assigned to the domain whose model gives a higher score implicitly while predicting [12, 19].
We employ a k-mer method based on a naïve Bayesian classifier for sequence binning before gene prediction . The binning model is trained on complete sequences of the selected 261 genomes by calculating the frequencies of k-mer oligonucleotides for each of them. For a given fragment s with the length of n bases, the probability of finding it in one of the 261 genomes can be calculated according to the overlapping (n-k+1) oligonucleotides by using Bayesian classification. Then, the fragment s is regarded as originating from the genome with the highest poster probability (details see Additional file 1: Fragment classification strategy). It has been successfully implemented in our previous work MetaTISA . To predict genes, we follow the strategy to train separate gene prediction models for Archaea and Bacteria that MetaGene and MetaGeneMark have applied. Therefore, the fragments will be also clustered into two different domains according to the phylogenetic relationships of the assigned genomes, and predicted by corresponding gene prediction models independently.
Feature selection for SVM
The support vector machine approach has been widely used in solving prediction problems in bioinformatics that can be represented in the form of a binary classification, such as gene identification, protein-protein interaction prediction and horizontally transferred gene detection [27–29]. It can learn more accurate classifiers for patterns that cannot be easily separated in the input space by transforming the input patterns into a feature space using a suitable kernel function (details see Additional file 1: SVM algorithm in MetaGUN). Selecting relevant features for machine learning approaches is important for a number of reasons such as generalization performance, running efficiency and feature interpretation. The support vector machine method makes no exception. In this work, we utilize three sets of statistics to elucidate the coding potential, the EDP description of codon usage, the TISs scores and the ORF length.
EDP description of codon usage
where is the abundance of the i th codon obtained by counting the number of it in the sequence divided by the total number of codons, i = 1, 2, ..., 61 represents the index of the 61 codons (excluding 3 stop codons), and is the Shannon entropy.
Translation initiation site scores
The length of ORFs
The ORF length is another useful feature that has been frequently used for the discrimination of protein-coding and non-coding ORFs [12, 14, 18, 31]. It is reported that the average length of genes in complete genomes is about 950 bp, which is much longer than random ORFs . In some current methods, a log-odds score or log-likelihood ratio is assigned to a given ORF according to the distributions of protein-coding genes and non-coding ORFs that are trained on complete genomes [12, 14]. However, the difficulty in integrating the ORF length feature is that a lager number of ORFs are incomplete for the short nature of metagenomic fragments [12, 14]. This phenomenon indicates that the complete and the incomplete ORFs should be treated separately. Since MetaGUN is built on a machine learning approach of the SVM, it is very convenient to accomplish the complete and incomplete issues in ORF length for they can be treated as two separate features. Hence, two values are assigned as ORF lengths, one for complete and the other for incomplete. For a specific ORF, the value of the corresponding type is set as the actual ORF length, while the other value is set to zero.
The composition patters of sequences from archaeal and bacterial genomes have been reported to be different, and tests have shown that the prediction scores will be degraded if models from the wrong domain are employed for scoring [12, 19]. Therefore, separate SVM classifiers for Achaea and Bacteria are trained on corresponding training genomes to server as gene prediction models in MetaGUN.
Gene prediction model training
To identify protein-coding genes, MetaGUN comprises two gene prediction modules namely the universal module and the novel module. SVM classifiers of the universal gene prediction module are trained based on complete genomes with the purpose of capturing the universal features of current known genes. In this work, to build the universal prediction module, 261 species are selected from NCBI RefSeq database release 45 (the latest release version at the time we started to design MetaGUN algorithm) according to the 'one species per genus' rule . The selected 261 species cover a wide range of phylogenetic clades, GC content and are isolated from varied environmental conditions, which can serve as good representatives for sequenced microbes. The amount of sequenced complete microbial genomes is growing dramatically with the revolutionary development of sequencing technology, however, we have found that our method based on these training genomes performs good results (see Results and discussions), which indicates that the selection of training genomes do capture the universal features of current known genomes. Moreover, many metagenomic sequencing projects aim to study the unculturable microorganisms, whose complete genomic sequences are currently unavailable. In these studies, the discovery of new genes with novel functionality is one of the principle objectives . Methods have been developed for the detection of the novel genes based on searching for conserved domains against known databases [32, 33]. The domain-based searches have been reported to be more sensitive to target genes than sequence similarity based methods like BLASTP because conserved domains other than the whole sequences are compared [27, 34]. For instance, Bork et al. applied the conserved domain analysis to RcaE proteins, and predicted 16 novel domain architectures that may have potential novel functionalities in habitats with little or no light . In our work, in an effort to address the novel gene prediction issue, a sample specific novel prediction module based on domain searches is incorporated.
Universal prediction module
Gene prediction performance on simulated shotgun sequences.
Novel prediction module
In the purpose of predicting genes that might be difficultly recognized by the universal gene prediction module, the sample specific novel module is then incorporated into MetaGUN based on the domain search approaches. Firstly, the extracted ORFs are translated into amino acid sequences and searched for conserved domains against the Conserved Domain Database (CDD) database. Those carrying detected domain motifs with significant e-values (< 10-40) are treated as training data of genes. To obtain the training instances of non-coding ORFs, we follow GISMO to implement the 'shadow' rule . That is, an ORF overlapping more than 90 bp with a targeted gene in another reading frame is regarded as a non-coding ORF. Then, the training data is clustered into two phylogenetic groups of Archaea and Bacteria according to the fragments classification results, and is employed as input feature vectors for training SVM classifiers for each domain independently. If the size of training items is larger than 1.6 M, a subset of 1.6 M will be randomly sampled for training SVM classifier according to the experience in the universal prediction module; otherwise, the whole training set will be used.
LibSVM package is employed in our work to train the SVM classifiers with Gaussian kernel function for both the universal prediction module and the novel prediction module . In each training procedure, a grid search of feature space is firstly implemented to find the most suitable Gaussian kernel parameter γ and SVM parameter C (details see Additional file 1: SVM algorithm in MetaGUN). Then all items in the training set of both the protein-coding and non-coding classes are implicitly mapped from the input space to the feature space that is determined by the Gaussian kernel under the learned best γ and C. Finally, a hyperplane (the SVM classifier) is learned by the SVM training program that optimally separates all training protein-coding and non-coding items.
Translation initiation site prediction
Accurate gene starts prediction is also a very important issue in metagenomic sequencing projects which is indispensable for experimental characterization of novel genes, however, has not been studied much in the literature [13, 21]. TIS prediction for complete genomes has a long history and a number of tools have been developed [24, 36–41]. The difficulty of TIS prediction in prokaryotic genomes is the divergency of the regulatory signals which indicate divergent translation initiation mechanisms. Studies have revealed that in the upstream of the TISs there are SD motifs for leadered genes and Non-SD signals for leaderless genes [41–43]. However, the short and anonymous nature of metagenomic fragments present more challenges.
In one of our previous works, MetaTISA has been built to accomplish this problem and has greatly improved the TIS annotations for MetaGeneAnnotator . Recently, two works have paid special attentions to the TIS prediction and have achieved substantial progresses [13, 14]. For example, MetaProdigal follows the same strategy as Prodigal, its version for isolated genomes, to use a TIS scoring system that integrates default scoring bins based on prior RBS motifs and rigorous searches for alternative motifs if no SD motifs appears . It also reported that the published MetaTISA tends to predict starts to downstream start codons for the genes whose true TISs are close to or run off the edge of the fragments .
Results and discussion
Due to the lacking of experimentally characterized genes and translation initiation sites in metagenomic sequencing projects, the performance of current methods are all evaluated on simulated fragments [12–14, 18–21]. However, two significant drawbacks of this methodology should be noted. Firstly, most annotated genes in NCBI RefSeq and GenBank database have not been verified by experiments. Annotation errors have been reported in some species, especially for the genomes with high GC-content [44, 45]. So, in recent studies of metagenomic gene finders, annotated hypothetical genes are removed from the benchmarks for reliable assessment [13, 14, 19]. Secondly, the reliability of TIS annotations in public databases is also suspicious. Large scale computational evaluation has been reported that RefSeq's TIS annotations biased to over-annotate the leftmost start codons and under-annotate the ATG start codons . Here, in the performance comparison of gene prediction, we follow MetaGene and Orphelia to choose the 12 genomes which have a good coverage of Archaea and Bacteria, as well as varied levels of GC content. Considering the mentioned problems in RefSeq annotations, we follow the same strategy as MetaGeneMark to discard the fragments containing any annotated hypothetical genes . Moreover, the TIS prediction accuracy are not evaluated on these genomes for the unreliability of TIS annotations. Instead, we use the 6 genomes where experimentally characterized gene starts are available for TIS prediction assessment .
Gene prediction performance on artificial shotgun sequences
We compare the prediction performance of MetaGUN on 3' end of genes with 6 current metagenomic gene finders in this section. Artificial shotgun fragments with 3x coverage are simulated for each of the 12 testing genomes. To demonstrate sequences produced by different sequencing technologies, three kinds of simulation are created with different sequence lengths (870 bp, 535 bp and 120 bp) according to the settings in Glimmer-MG . In addition, fragments with length of 1200 bp are also simulated in order to investigate the performance on assembled contigs of larger size. Predictions with exactly matched 3' ends or matched reading frame if 3' ends are missed are regarded as correctly predicted genes, that is, the true positives. The sensitivity (Sn) and the specificity (Sp) are defined as the true positives in all annotated genes and in all predicted genes, respectively. We also use the harmonic mean value as a composite measure of sensitivity and specificity, which is defined as 2 SnSp/(Sn+Sp). Note that unlike the comparisons in Glimmer-MG, simulated fragments overlapping annotated hypothetical genes are excluded from the testing sets in this work, hence the benchmarks are complete and the measures of sensitivity and specificity are both meaningful.
The predictions of other methods are obtained by local running. The 'complete' model parameter trained for error-free sequences is set to run FragGeneScan , and both the 'Net700' and 'Net300' model are used for running Orphelia and the better result is chosen for comparison . Others are implemented by default settings. For comprehensive investigation, we run two versions of MetaGUN, one is trained on all 261 training genomes which denotes as 'MGC' in Table 1; the other is trained on genomes excluding 12 testing genomes which denotes as 'MG'. The comparisons with other methods is based on the 'MG' version. In addition, since most metagenomic gene finders overlook genes less than 60 bp, we only evaluated genes with length more than that.
The accuracies are shown in Table 1. For fragments of longer length, that is 1200 bp, 870 bp and 535 bp, MetaGUN outperforms other gene finders in harmonica mean with values over 96%. While for shorter fragments of 120 bp, performance falls severely for all methods, especially Orphelia. This illustrates one of the challenges for predicting genes on short sequences is the uninformative incomplete ORFs. At this length, MetaGUN and Glimmer-MG achieves comparable performance with more than 91% in harmonic mean, which is much better than other methods. It is worth noting that MetaGUN always makes the best specificities among all simulations with different fragment lengths, which means its prediction is the most reliable. The Orphelia method, the other one based on the machine learning approach, also exhibits good results in specificity in longer fragments. However, its sensitivities are usually lower than others. The comparison on the results of 3' ends indicates that MetaGUN makes better predictions among existed algorithms for longer fragments that are produced under Sanger and Roche454 sequencing platforms, as well as longer contigs after assembly. Despite the performance is not superior to Glimmer-MG on the shorter fragments corresponds to Illumina sequencing platform, it is still much better than others. Moreover, with the aid of deep sequencing and effective assembly, the length of contigs will get longer. In a recent study on human gut microbiome with deep sequencing, Qin et al. reported that as much as 42.7% of the Illumina GA reads have been assembled to contigs longer than 500 bp, with an N50 length of 2.2 kb . Meanwhile, the sequencing technologies are developing to produce longer reads in which MetaGUN can perform better than others.
A practical problem of metagenomic fragments is the sequencing errors. The error rates of raw data are reported to range from 0.001% to 1% for Sanger sequencing, and from 0.5% to 2.8% for pyrosequencing . Prior work has shown that sequencing errors present severe impact on gene prediction, especially the frame shifts . Two of previously mentioned metagenomic gene finders, FragGeneScan and Glimmer-MG, have specially designed models to address this issue and have achieved better accuracies than other methods when running on error-prone fragments [14, 20]. However, in this work, we concentrate on predicting genes on error-free fragments for following reasons. Firstly, most low-quality nucleotides locate around the ends of the reads, and can be cut out by quality trimming and vector screening, or can be corrected by sequence assembly . Secondly, separate software has been developed for identifying frame shifts for metagenomic fragments. It can be implemented prior to gene prediction to reduce the influences of sequencing errors . Moreover, it is promising that frame shift can be greatly decrease with the aid of deeper sequencing, effective assembly and future improvements of sequencing technologies.
TIS prediction performance on experimental data
Since many environmental sequencing projects are aiming at studying gene functions by experimentally characterization, accurate prediction of TISs is very important for correct TISs is indispensable for expressing genes [18, 21]. To investigate the TIS prediction performance, we implement almost the same strategy applied in MetaTISA with two adjustments. Firstly, we follow Hyatt et al.  to assess the TIS accuracy on both the internal TISs and the external TISs. An internal TIS is a TIS locates inside a fragment, and an external TIS is that exceeds the edge of a fragment. Secondly, the simulated fragment lengths are 870 bp and 535 bp. Shorter fragment is not considered in TIS assessment as it is too short that the true TIS exceeds the fragment in most cases.
TIS prediction performance on experimentally characterized gene starts.
Application to human gut microbiome
Application to 2 human gut microbiome samples.
It is widely accepted that microorganisms in human gut microbiome can contribute certain vitamins to the host . We have found an interesting case that can provide a clue. A domain named cobN, which usually exists in cobN genes that involved in cobalt transport or B12 biosynthesis in a number of species like actinobacteria, cyanobacteria, betaproteobacteria and pseudomonads. Moreover, domains involved in short-chain dehydrogenase are also detected in some genes, which is reported to be used by gut bacteria for fermentation to generate energy and converting sugars . Similar to the phylogenetic distribution of genes analysis on IMG/M website, domains originated from Eukaryotes and Viruses are also detected, like ATG13 (from Autophagy-related protein 13), danK (from heat shock protein) and PAT1 (from Topoisomerase II-associated protein).
In this article, we present a novel method for identifying genes in metagenomic fragments. It comprises three steps for gene prediction by firstly classifying input sequences into different phylogenetic groups, then identifying genes for each group independently with both universal prediction module and novel prediction module and finally relocating TISs employing a modified version of MetaTISA. We compared the prediction results with 6 current metagenomic gene finders. For the performance on 3' end of genes, MetaGUN are better than other methods on longer fragments and are comparable with Glimmer-MG which are much better than others on shorter fragments. A notable advantage is that MetaGUN always makes the best reliable predictions. For the assessments of 5' end of genes, MetaGUN outperforms others on the overall TISs and especially predicts much more correct internal TISs. The application to 2 samples from human gut microbiome also shows that MetaGUN predict more reliable results. Furthermore, we have attempted to investigate the novel gene discovery ability on these 2 real samples. With the effective integration of the novel prediction module, MetaGUN can find more potential novel genes than others. Detailed analysis of the discovered potential novel genes shows that there exists a number of biological meaningful cases. Overall, MetaGUN makes substantial advances for gene prediction in metagenomic fragments with three notable contributions: the improvements for both the protein-coding sequences and the translation initiation sites, and the greater ability for novel gene discovery. We believe that MetaGUN will serve as a useful tool for both bioinformatics and experimental researches.
We wish to thank Prof. Chunting Zhang of Tianjin University, Prof. Xuegong Zhang of Tsinghua
University for interest to the project and useful discussions. We also thank Dr. Xiaobin Zheng, Binbin Lai, Longshu Yang, Luying Liu, Qi Wang and Xiaoqi Wang for their helps to the work.
Publication of this article was supported by the National Key Technology Research and Design Program of China (2012BAI06B02), National Natural Science Foundation of China (30970667, 11021463, 61131003 and 91231119), National Basic Research Program of China (2011CB707500), and Excellent Doctoral Dissertation Supervisor Funding of Beijing (YB20101000102).
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 5, 2013: Proceedings of the Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq 2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S5.
- Pruitt KD, Tatusova T, Klimke W, Maglott DR: NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 2009, 37 (Database issue): D32-D36.PubMed CentralView ArticlePubMedGoogle Scholar
- Hugenholtz P: Exploring prokaryotic diversity in the genomic era. Genome Biol. 2002, 3 (2): REVIEWS0003-PubMed CentralView ArticlePubMedGoogle Scholar
- Rappe MS, Giovannoni SJ: The uncultured microbial majority. Annu Rev Microbiol. 2003, 57: 369-394. 10.1146/annurev.micro.57.030502.090759.View ArticlePubMedGoogle Scholar
- Wooley JC, Godzik A, Friedberg I: A primer on metagenomics. PLoS Comput Biol. 2010, 6 (2): e1000667-10.1371/journal.pcbi.1000667.PubMed CentralView ArticlePubMedGoogle Scholar
- Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P: A bioinformatician's guide to metagenomics. Microbiol Mol Biol Rev. 2008, 72 (4): 557-78. 10.1128/MMBR.00009-08. Table of ContentsPubMed CentralView ArticlePubMedGoogle Scholar
- Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004, 428 (6978): 37-43. 10.1038/nature02340.View ArticlePubMedGoogle Scholar
- Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers YH, Smith HO: Environmental genome shotgun sequencing of the Sargasso Sea. Science. 2004, 304 (5667): 66-74. 10.1126/science.1093857. [http://dx.doi.org/10.1126/science.1093857]View ArticlePubMedGoogle Scholar
- Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM: Comparative metagenomics of microbial communities. Science. 2005, 308 (5721): 554-557. 10.1126/science.1107851. [http://dx.doi.org/10.1126/science.1107851]View ArticlePubMedGoogle Scholar
- Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ, Samuel BS, Gordon JI, Relman DA, Fraser-Liggett CM, Nelson KE: Metagenomic analysis of the human distal gut microbiome. Science. 2006, 312 (5778): 1355-1359. 10.1126/science.1124234. [http://dx.doi.org/10.1126/science.1124234]PubMed CentralView ArticlePubMedGoogle Scholar
- Kurokawa K, Itoh T, Kuwahara T, Oshima K, Toh H, Toyoda A, Takami H, Morita H, Sharma VK, Srivastava TP, Taylor TD, Noguchi H, Mori H, Ogura Y, Ehrlich DS, Itoh K, Takagi T, Sakaki Y, Hayashi T, Hattori M: Comparative metagenomics revealed commonly enriched gene sets in human gut microbiomes. DNA Res. 2007, 14 (4): 169-181. 10.1093/dnares/dsm018.PubMed CentralView ArticlePubMedGoogle Scholar
- Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, Nielsen T, Pons N, Levenez F, Yamada T, Mende DR, Li J, Xu J, Li S, Li D, Cao J, Wang B, Liang H, Zheng H, Xie Y, Tap J, Lepage P, Bertalan M, Batto JM, Hansen T, Paslier DL, Linneberg A, Nielsen HB, Pelletier E, Renault P, Sicheritz-Ponten T, Turner K, Zhu H, Yu C, Li S, Jian M, Zhou Y, Li Y, Zhang X, Li S, Qin N, Yang H, Wang J, Brunak S, Doré J, Guarner F, Kristiansen K, Pedersen O, Parkhill J, Weissenbach J, Consortium MIT, Bork P, Ehrlich SD, Wang J: A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010, 464 (7285): 59-65. 10.1038/nature08821.PubMed CentralView ArticlePubMedGoogle Scholar
- Noguchi H, Park J, Takagi T: MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res. 2006, 34 (19): 5623-5630. 10.1093/nar/gkl723.PubMed CentralView ArticlePubMedGoogle Scholar
- Hyatt D, Locascio PF, Hauser LJ, Uberbacher EC: Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics. 2012, 28 (17): 2223-2230. 10.1093/bioinformatics/bts429. [http://dx.doi.org/10.1093/bioinformatics/bts429]View ArticlePubMedGoogle Scholar
- Kelley DR, Liu B, Delcher AL, Pop M, Salzberg SL: Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering. Nucleic Acids Res. 2012, 40: e9-10.1093/nar/gkr1067.PubMed CentralView ArticlePubMedGoogle Scholar
- Badger JH, Olsen GJ: CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol. 1999, 16 (4): 512-524. 10.1093/oxfordjournals.molbev.a026133.View ArticlePubMedGoogle Scholar
- Frishman D, Mironov A, Mewes HW, Gelfand M: Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Res. 1998, 26 (12): 2941-2947. 10.1093/nar/26.12.2941.PubMed CentralView ArticlePubMedGoogle Scholar
- Noguchi H, Taniguchi T, Itoh T: MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Res. 2008, 15 (6): 387-396. 10.1093/dnares/dsn027.PubMed CentralView ArticlePubMedGoogle Scholar
- Hoff KJ, Tech M, Lingner T, Daniel R, Morgenstern B, Meinicke P: Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinformatics. 2008, 9: 217-10.1186/1471-2105-9-217. [http://dx.doi.org/10.1186/1471-2105-9-217]PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu W, Lomsadze A, Borodovsky M: Ab initio gene identification in metagenomic sequences. Nucleic Acids Res. 2010, 38 (12): e132-10.1093/nar/gkq275.PubMed CentralView ArticlePubMedGoogle Scholar
- Rho M, Tang H, Ye Y: FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Res. 2010, 38 (20): e191-10.1093/nar/gkq747.PubMed CentralView ArticlePubMedGoogle Scholar
- Hu GQ, Guo JT, Liu YC, Zhu H: MetaTISA: Metagenomic Translation Initiation Site Annotator for improving gene start prediction. Bioinformatics. 2009, 25 (14): 1843-1845. 10.1093/bioinformatics/btp272.View ArticlePubMedGoogle Scholar
- Ouyang Z, Zhu H, Wang J, She ZS: Multivariate entropy distance method for prokaryotic gene identification. J Bioinform Comput Biol. 2004, 2 (2): 353-373. 10.1142/S0219720004000624.View ArticlePubMedGoogle Scholar
- Zhu H, Hu GQ, Yang YF, Wang J, She ZS: MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes. BMC Bioinformatics. 2007, 8: 97-10.1186/1471-2105-8-97.PubMed CentralView ArticlePubMedGoogle Scholar
- Hu GQ, Zheng XB, Zhu HQ, She ZS: Prediction of translation initiation site for microbial genomes with TriTISA. Bioinformatics. 2009, 25: 123-125. 10.1093/bioinformatics/btn576.View ArticlePubMedGoogle Scholar
- Chang CC, Lin CJ: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011, 2: 27:1-27:27.View ArticleGoogle Scholar
- Sandberg R, Winberg G, Bränden CI, Kaske A, Ernberg I, Cöster J: Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. Genome Res. 2001, 11 (8): 1404-1409. 10.1101/gr.186401.PubMed CentralView ArticlePubMedGoogle Scholar
- Krause L, McHardy AC, Nattkemper TW, Puhler A, Stoye J, Meyer F: GISMO-gene identification using a support vector machine for ORF classification. Nucleic Acids Res. 2007, 35 (2): 540-549.PubMed CentralView ArticlePubMedGoogle Scholar
- Guo Y, Yu L, Wen Z, Li M: Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res. 2008, 36 (9): 3025-3030. 10.1093/nar/gkn159.PubMed CentralView ArticlePubMedGoogle Scholar
- Tsirigos A, Rigoutsos I: A sensitive, support-vector-machine method for the detection of horizontal gene transfers in viral, archaeal and bacterial genomes. Nucleic Acids Res. 2005, 33 (12): 3699-3707. 10.1093/nar/gki660.PubMed CentralView ArticlePubMedGoogle Scholar
- Delcher AL, Harmon D, Kasif S, White O, Salzberg SL: Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 1999, 27 (23): 4636-4641. 10.1093/nar/27.23.4636.PubMed CentralView ArticlePubMedGoogle Scholar
- Larsen TS, Krogh A: EasyGene-a prokaryotic gene finder that ranks ORFs by statistical significance. BMC Bioinformatics. 2003, 4: 21-10.1186/1471-2105-4-21.PubMed CentralView ArticlePubMedGoogle Scholar
- Singh AH, Doerks T, Letunic I, Raes J, Bork P: Discovering functional novelty in metagenomes: examples from light-mediated processes. J Bacteriol. 2009, 191: 32-41. 10.1128/JB.01084-08.PubMed CentralView ArticlePubMedGoogle Scholar
- Krause L, Diaz NN, Bartels D, Edwards RA, Puhler A, Rohwer F, Meyer F, Stoye J: Finding novel genes in bacterial communities isolated from the environment. Bioinformatics. 2006, 22 (14): e281-e289. 10.1093/bioinformatics/btl247.View ArticlePubMedGoogle Scholar
- Harrington ED, Singh AH, Doerks T, Letunic I, von Mering C, Jensen LJ, Raes J, Bork P: Quantitative assessment of protein function prediction from metagenomics shotgun sequences. Proc Natl Acad Sci USA. 2007, 104 (35): 13913-13918. 10.1073/pnas.0702636104.PubMed CentralView ArticlePubMedGoogle Scholar
- Richter DC, Ott F, Auch AF, Schmid R, Huson DH: MetaSim: a sequencing simulator for genomics and metagenomics. PLoS One. 2008, 3 (10): e3373-10.1371/journal.pone.0003373. [http://dx.doi.org/10.1371/journal.pone.0003373]PubMed CentralView ArticlePubMedGoogle Scholar
- Besemer J, Lomsadze A, Borodovsky M: GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 2001, 29 (12): 2607-2618. 10.1093/nar/29.12.2607.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu H, Hu GQ, Ouyang ZQ, Wang J, She ZS: Accuracy improvement for identifying translation initiation sites in microbial genomes. Bioinformatics. 2004, 20 (18): 3308-3317. 10.1093/bioinformatics/bth390.View ArticlePubMedGoogle Scholar
- Tech M, Pfeifer N, Morgenstern B, Meinicke P: TICO: a tool for improving predictions of prokaryotic translation initiation sites. Bioinformatics. 2005, 21 (17): 3568-3569. 10.1093/bioinformatics/bti563.View ArticlePubMedGoogle Scholar
- Makita Y, de Hoon MJL, Danchin A: Hon-yaku: a biology-driven Bayesian methodology for identifying translation initiation sites in prokaryotes. BMC Bioinformatics. 2007, 8: 47-10.1186/1471-2105-8-47.PubMed CentralView ArticlePubMedGoogle Scholar
- Delcher AL, Bratke KA, Powers EC, Salzberg SL: Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007, 23 (6): 673-679. 10.1093/bioinformatics/btm009.PubMed CentralView ArticlePubMedGoogle Scholar
- Hu GQ, Zheng X, Yang YF, Ortet P, She ZS, Zhu H: ProTISA: a comprehensive resource for translation initiation site annotation in prokaryotic genomes. Nucleic Acids Res. 2008, 36 (Database issue): D114-D119.PubMed CentralPubMedGoogle Scholar
- Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ: Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010, 11: 119-10.1186/1471-2105-11-119.PubMed CentralView ArticlePubMedGoogle Scholar
- Zheng XB, Hu GQ, She ZS, Zhu H: Leaderless genes in bacteria: clue to the evolution of translation initiation mechanisms in prokaryotes. BMC Genomics. 2011, 12: 361-10.1186/1471-2164-12-361.PubMed CentralView ArticlePubMedGoogle Scholar
- Luo C, Hu GQ, Zhu H: Genome reannotation of Escherichia coli CFT073 with new insights into virulence. BMC Genomics. 2009, 10: 552-10.1186/1471-2164-10-552.PubMed CentralView ArticlePubMedGoogle Scholar
- Angelova M, Kalajdziski S, Kocarev L: Computational Methods for Gene Finding in Prokaryotes. ICT Innovations. 2010, 11-20.Google Scholar
- Hu GQ, Zheng X, Ju LN, Zhu H, She ZS: Computational evaluation of TIS annotation for prokaryotic genomes. BMC Bioinformatics. 2008, 9: 160-10.1186/1471-2105-9-160.PubMed CentralView ArticlePubMedGoogle Scholar
- Hoff KJ: The effect of sequencing errors on metagenomic gene prediction. BMC Genomics. 2009, 10: 520-10.1186/1471-2164-10-520.PubMed CentralView ArticlePubMedGoogle Scholar
- Antonov I, Borodovsky M: Genetack: frameshift identification in protein-coding sequences by the Viterbi algorithm. J Bioinform Comput Biol. 2010, 8 (3): 535-551. 10.1142/S0219720010004847.View ArticlePubMedGoogle Scholar
- Marchler-Bauer A, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Liebert CA, Liu C, Lu F, Lu S, Marchler GH, Mullokandov M, Song JS, Tasneem A, Thanki N, Yamashita RA, Zhang D, Zhang N, Bryant SH: CDD: specific functional annotation with the Conserved Domain Database. Nucleic Acids Res. 2009, 37 (Database issue): D205-D210.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.