Volume 14 Supplement 9
MGC: a metagenomic gene caller
© El Allali and Rose; licensee BioMed Central Ltd. 2013
Published: 28 June 2013
Computational gene finding algorithms have proven their robustness in identifying genes in complete genomes. However, metagenomic sequencing has presented new challenges due to the incomplete and fragmented nature of the data. During the last few years, attempts have been made to extract complete and incomplete open reading frames (ORFs) directly from short reads and identify the coding ORFs, bypassing other challenging tasks such as the assembly of the metagenome.
In this paper we introduce a metagenomics gene caller (MGC) which is an improvement over the state-of-the-art prediction algorithm Orphelia. Orphelia uses a two-stage machine learning approach and computes a model that classifies extracted ORFs from fragmented sequences. We hypothesise and demonstrate evidence that sequences need separate models based on their local GC-content in order to avoid the noise introduced to a single model computed with sequences from the entire GC spectrum. We have also added two amino-acid features based on the benefit of amino-acid usage shown in our previous research. Our algorithm is able to predict genes and translation initiation sites (TIS) more accurately than Orphelia which uses a single model.
Learning separate models for several pre-defined GC-content regions as opposed to a single model approach improves the performance of the neural network as demonstrated by the experimental results presented in this paper. The inclusion of amino-acid usage features also helps improve the overall accuracy of our algorithm. MGC's improvement sets the ground for further investigation into the use of GC-content to separate data for training models in machine learning based gene finders.
In cultured microbes, the shotgun sequences that result from sequencing the full genome come from a single clone which makes the assembly and annotation of the genome manageable. In metagenomics, the uncultured microbes are sampled directly from their environment. Next generation sequencing (NGS) used in metagenomics results in a much larger amount of data than traditional sequencing. However, the resulting sequences are noisy, partial and most importantly, may come from thousands of different species. Therefore, the assembly and annotation of the large metagenomics data present more challenges. Several methods have shown promising results and efficiency in assembling metagenomic data [3, 4]. However these methods are designed for single genomes. Consequently they don't work well in cases where there are multiple species present as is the case in environmental samples. One way to deal with these difficulties is to bypass assembly and go directly to finding genes.
New methods are being developed to predict genes specifically in metagenomics. The best known methods in this field are MetaGene , Orphelia , and FragGeneScan . MetaGene uses a similar approach to GeneMark.hmm  which takes into account the GC-content sensitive monocodon and dicodon models computed from fully annotated genomes. Once MetaGene extracts all the possible open reading frames (ORFs) present in the fragments, it uses statistical models computed from fully annotated genomes to score the fragments. The next step uses a dynamic programing algorithm that combines the previous score with the ORF length, the distance between the ORF and its neighbor, and the distance between the translation initiation start (TIS) and the left-most start codon. The goal of the dynamic programing algorithm is to select the final set of ORFs by resolving the overlap between ORFs. The scoring system is based on the log-odds ratios of observed frequency in coding ORFs and observed frequency in random ORFs. Two models are used by MetaGene, one for bacteria and one for archaea. These are automatically selected based on the outcome of a pre-defined domain classification method during the classification. MetaGene has been tested on randomly sampled fragments of size 700 bp from 12 annotated whole genomes. The results show the ability of MetaGene to predict genes with high sensitivity and slightly lower specificity. Orphelia obtains better performance than MetaGene by using a two-stage machine learning approach. The first stage builds linear discriminants for monocodon and dicodon usage as well as the TIS features extracted from the ORFs. This step linearly extracts features from the high dimensional features obtained from the codon usage and the TIS information, reducing each usage to a single feature. The next stage combines the features obtained from the linear discriminants as well as length and GC-content features using a non-linear neural network which produces the probability that a given ORF encodes a protein. Finally, Orphelia deploys a post-processing algorithm which uses probabilities from its scoring scheme in order to resolve the overlap. Orphelia is tested in a similar way to MetaGene, however more extensive experiments have been conducted including studying the effect of different fragment lengths, the accuracy of the program in predicting the TIS as well as complete vs. incomplete prediction capability of the program.
FragGeneScan is an algorithm based on hidden Markov models (HMM) capable of predicting genes in both complete genomes and metagenomic fragments . The algorithm combines codon usage, sequence patterns for start/stop codons and sequencing error models using HMMs. The Viterbi algorithm is used to decide the best path of hidden states that generates the observed nucleotide fragment. The accuracy of FragGeneScan in short reads was compared to that of MetaGene. For simulated 700 bp reads with no sequencing error, FragGeneScan and MetaGene achieve comparable performance . However, for shorter reads and reads with sequencing errors, FragGeneScan shows consistently better performance over MetaGene .
In this paper we introduce a new metagenomics gene caller called MGC which is based on a two-stage machine learning approach similar to that of the state-of-the-art program Orphelia . MGC learns separate models for several pre-defined GC ranges as opposed to the single model approach used by Orphelia and applies the appropriate model to each fragment based on its GC-content. Chan and Stolfo  investigated model combination for machine learning classification and showed that models learned from disjoint partitions of a dataset outperform a single model learned from the entire dataset. Separating the training data by GC-content provides MGC with mutually exclusive partitions of the data in order to train multiple models.
We use GC-content to partition the training dataset for our two-stage machine learning approach. The use of GC-content for this purpose is inspired by the causal relationship between nucleotide bias and amino acid composition. Singer and Hickey  demonstrated that nucleotide bias can have a dramatic effect on the amino acid composition of the encoded proteins, they showed that GC-poor genomes have proteins that are rich in the FYMINK amino acids and GC-rich genomes have proteins that are rich in the GARP amino acids. This effect is not only present in complete genomes but it is also valid for individual genes. Singer and Hickey  identified genes common between a GC-rich genome (B. burgdorferi) and a GC-poor genome (M. tuberculosis) and measured the synonymous nucleotide frequencies and amino acid contents of each gene. While there was no overlap in the synonymous GC-contents of these two genomes, some overlap in the amino acid proportions of the encoded proteins exists. However, no overlap in the amino acid proportions of the encoded proteins in the common genes was found, the GARP/FYMINK ratio in the M. tuberculosis homolog was higher than the ratio of the corresponding gene in B. burgdorferi. Separating the models by GC-content can ensure that both compositions are accounted for instead of combining them into one model.
GC-content influences codon usage which in turn influences the amino acid usage. Lightfield et al.  have shown that across bacterial Phyla, distantly-related genomes with similar genomic GC-content have similar patterns of amino acid usage. They examined codon usage patterns and were able to predict protein amino acid content as a function of genomic GC-content. Lightfield et al.  demonstrated that use of amino acids encoded by GC-rich codons increased by approximately 1% for each 10% increase in genomic GC-content, the opposite was also true for GC-poor codons. Separating GC-contents into several GC ranges will ensure that the different linear discriminants can separate the codon and amino acid usage more precisely.
Another effect of GC-content is its link to the length of the genes. GC-rich genes in prokaryotes tend to be the longest while GC-poor genes tend to be the shortest . The longer the gene is, the more candidate TIS codons the ribosome encounters. Unlike the ribosome, models find it hard to pick the correct TIS from a large number of candidates especially when they are close to each other. In addition to the number of candidate TIS codons, these candidates share most of the TIS window used to compute the features. Having separate models for genes that have a large number of start codons will ensure that the subtle difference between the candidates is learned by the non-linear neural networks.
In addition to separating the models by GC-content, MGC uses two amino acid features motivated by the benefit that these features have demonstrated in our previous research . The use of amino acid composition as a protein feature is an early discovery. Amino acid bias has been used in several identification problems such as gene expression , protein identification , family classification  and protein secondary structure prediction . For example, Misawa and Kikuno have found that the effect of amino acid composition on gene expression is stronger than that of the codon composition . In a survey of codon and amino acid frequency bias in microbial genomes, Merkl found that optimizing translational efficiency has an effect on biased amino acid composition . If a cell requires certain proteins in large quantities then the amino acids that consumes less energy during translation appear more frequently . This bias is not adequately represented by GC content or codon usage. We hypothesise that amino acid usage provides our models with species-specific differences caused by protein synthesis energy constraints.
We use the same two datasets used by Orphelia, one for training the neural network models and a second one for testing the MGC algorithm. The first dataset consists of 131 fully sequenced Bacterial and Archael genomes and their corresponding gene annotations obtained from GenBank  and the second dataset is comprised of ten Bacterial and three Archael genomes. Hoff et al.  list all the genomes used for training the Orphelia neural network in the supplementary materials of their paper and all genomes used for testing in Table 1 of their publication. The n-fold coverage for a genome is defined as the amount of sampled DNA that is equal in total length to n-times as the length of the original genome complete sequence. Fragments of 700 bp are randomly excised to create a 1-fold genome coverage for each genome in the training dataset and a 5-fold coverage for each genome in the testing dataset.
Two additional training datasets (different from the neural network training data) are used for the preprocessing step required in feature extraction. The first dataset is used for preprocessing the codon usage as described in the next section. Sequences are randomly sampled to create a 0.5-fold genome coverage. Annotated genes serve as positive examples (≈ 1.9 × 105 examples) and the longest ORF in each non-coding ORF-set serve as the negative examples (≈ 2.8 × 106 examples). The second dataset is used to preprocess the TIS feature which will be described in the next section. Symmetric windows of 60 bp around the TIS of the previously selected genes in the first dataset serve as positive examples (≈ 1.9 × 105 examples) while similar windows around the remaining start codons forming the ORF-set of each gene serve as negative examples (≈ 5.6 × 106 examples).
In order to train the neural network, ORFs are extracted from the neural network training dataset and divided into positive and negative examples. ORFs from annotated genes serve as the positive examples (≈ 2.6 × 106 examples) while one randomly selected ORF out of each non-coding ORF-set make up the negative examples (≈ 4.5 × 106 examples).
The MGC algorithm
MGC is a metagenomic gene caller based on a two-stage machine learning approach similar to that of the state-of-the-art program Orphelia . The first stage consists of linear discriminants that reduce a high dimensional feature space into a smaller one. For example, the linear discriminant for the dicodon usage reduces the 4096 dicodon frequencies into a single feature. However, these features are not linear across the entire GC spectrum. GC-content has a direct effect on codon and amino acid usages which means that fragments with similar GC-content should have similar features. Therefore, building different linear discriminants for each GC range will result in a better linear combination of the feature space which will better characterize the coding class.
For each GC range we obtain a model using features computed from all the sequences in the training dataset that have GC-content within the GC range. The same GC ranges used to compute the linear discriminants are used to build the neural network models. Different partitionings by GC-content are used to study the effect of the GC range size on the performance of MGC. In this paper we investigate the outcome of MGC models trained by partioning the training data into 10%, 5% and 2.5% ranges. For the remaining of this paper, we refer to these ranges as the 10%, 5% and 2.5% GC ranges.
Once the models are trained, all possible complete and incomplete ORFs are extracted from the input fragment and their corresponding features are extracted using the same linear discriminant step used for training. Based on the GC-content of the fragment, the corresponding neural network model is used to score the ORF. The output of the neural network is the approximation of the posterior probability that the ORF is coding. Step 2 in Figure 2 illustrates the neural network model. Once all input ORFs are scored by the neural networks, the same greedy algorithm used by Orphelia is deployed to resolve the overlap between all candidate ORFs that have a probability greater than 0.5. Given the candidate list for a particular fragment containing all ORFs i with probability P i > 0.5, Algorithm 1 describes the selection scheme used to generate the final list of genes. The maximum allowed overlap is o max = 60 bp which is the minimal gene length considered for prediction. A more reasonable overlap would be 45 bp which is believed to be the maximum overlap for bacterial genes. We use the same overlap used by Orphelia for comparison reason. However, the overlap is a variable that the user can change.
Algorithm 1 The final candidate selection
while is nonempty do
Find i max = argmax i P i with respect to all ORFs i in
Move ORF i max from to
Remove all the ORFs in that overlap with ORF i max by more than o max
Where X MA and X DA represent the monoamino and diamino-acid usage respectively, λ is the regularization parameter and y M and y D represent the sequence labels for the data points in X MA and X DA respectively ( represents whether sequence i is a positive example or a negative example ). The linear discriminants for codon features are computed similarly. The monoamino-acid and diamino-acid features are then obtained simply as x = w MA · x MA and x = w DA · x DA respectively.
The resulting nine features for all the training examples in each GC range are combined in a non-linear fashion using a neural network. The output of each network is the posterior probability of an ORF encoding a protein. We use a standard multilayer perceptron to train the MGC models. This is similar to Orphelia  with the exception that we have two more features, and we are training models that are GC range specific. For each GC range we obtain a model using features computed from all the sequences in the training dataset that have GC-content within the GC range. The same GC ranges used to compute the linear discriminants are used to build the neural network models. Different splits by GC-content were used to study the effect of the GC range size on the performance of MGC. In this paper, the MGC models were trained using the 10%, 5% and 2.5% ranges.
where are input weight vectors and are the bias parameters.
where z is a vector containing all the z i vectors.
The regularization matrix A = diag(a 1, ..., a 1, a 2, ..., a 2, a 3, ..., a 3, a 4) requires four strictly positive hyperparameters a1, a2, a3, a4 for separate scaling of the parameters , , w o , b o . Hoff et al.  use the evidence framework for the adaptation of hyperparameters. This framework is introduced by MacKay  and is based on a Gaussian approximation of the posterior distribution of network weights. This evidence-based adaptation of the hyperparameters is incorporated into the network training and uses the same training points.
In order to minimize the objective function in equation 5, a scaled conjugate gradient scheme is used as implemented in the NETLAB toolbox . The hyperparameters are all initially set to 0.001 and the weight and bias parameters are randomly initialized based on a standard normal distribution. The training scheme is iterated 50 times where each iteration consists of 50 gradient steps followed by two hyperparameter adaptation steps.
For example if we consider the 10% GC ranges, MGC computes 10 models using the training sequences from each GC range. Let θ j , where j ∈ 1..10, denote the resulting neural network model for a given GC range j. Training the model θ j is similar to training the single model θ as described above and using only the training examples that have GC-content within the GC range j. The network output for a given test sample x i is computed as f(x i ; θ j ) = P i , where the GC-content of the fragment that contains x i is within the GC range j.
Results and discussion
The performance of MGC is measured using the sensitivity and specificity measures which evaluate the capability of detecting annotated genes and the reliability of the gene predictions respectively. The performance measures are computed for predicted genes in fragments with length 700 bp from 10 random replications of 10 bacterial and 3 archaeal genomes based on their GenBank  annotations.
The accuracy of TIS was measured using the TIS correctness measure in equation 9. This measure is used because the traditional sensitivity and specificity measures are not suitable for measuring TIS prediction performance since they measure the performance of the gene prediction rather than the TIS accuracy.
MGC performance by GC ranges.
MGC versus Orphelia and FragGeneScan.
TIS accuracy comparison between MGC and Orphelia.
64.12 ± 0.84
51.03 ± 0.85
66.26 ± 0.39
51.10 ± 0.60
65.47 ± 0.22
58.85 ± 0.25
84.15 ± 0.70
65.30 ± 1.44
72.06 ± 0.81
63.41 ± 0.94
71.46 ± 0.35
59.43 ± 0.63
72.74 ± 0.27
64.24 ± 0.35
68.55 ± 0.71
60.37 ± 0.71
68.86 ± 0.42
61.09 ± 0.35
69.49 ± 0.65
53.93 ± 0.71
67.85 ± 0.46
56.11 ± 0.69
71.33 ± 0.69
60.29 ± 0.57
71.18 ± 0.37
68.32 ± 0.38
TIS correctness measure is computed from a subset of the predicted genes with an annotated TIS. Direct comparison of the two methods based on this measure is difficult since they predict a different number of genes. Nonetheless, we notice that the improvement of the TIS correctness is comparable to that of the sensitivity and specificity measures. Specifically, we observe that the average TIS correctness of our algorithm is 11.39% higher than that of Orphelia.
The results show the improvement of MGC in performance over that of Orphelia. We hypothesized that learning separate models for several pre-defined GC-content regions as opposed to the single model approach used by Orphelia would improve the performance of the neural network. The current results support this hypothesis. The 5% GC range models exhibit an improvement around 1% on average than that of the 2.5% GC range models. The 10% GC range models also exhibit a slight improvement over the 5% GC range models. However this result is within the standard deviation. Thus, there is no need to investigate models for smaller GC ranges to prove the benefit of having multiple models versus a single model. However, it would be useful to compute other models based on larger GC ranges in order to investigate and find better partitions of the GC spectrum.
MGC outperforms Orphelia in TIS prediction accuracy. Evaluating TIS recognition is hampered by the fact that we must rely on published annotations, many of which are generated automatically and have not been fully verified. This is a well recognized problem in traditional gene annotation.
The Orphelia algorithm was tested on different fragment sizes by building models for fragments ranging from 200 bp to 500 bp with increments of 20 bp. Hoff et al. recommend using the 700 bp model for all fragments greater than 300 bp, while fragments ranging from 200 bp to 300 bp should be run using the 300 bp model . According to a recent metagenomics survey by Thomas et al.  the 454/Roche and the Illumina/Solexa systems are the most commonly used systems. While the Illumina/Solexa system produce shorter reads, the average read length for 454/Roche technology ranges between 600 and 800 bp . MGC's 700 bp models are sufficient for longer reads such as 454/Roche reads. We are currently developing 300 bp models in order to handle shorter reads such as those from the llumina/Solexa system.
In this paper we show that learning separate models for several pre-defined GC-content regions as opposed to the single model approach used by Orphelia lead to an improvement of performance. We also show that the amino-acid usage helps to improve the overall accuracy of the gene finder. In the future, we plan to evaluate models based on different GC ranges. We also plan to use ensemble techniques to combine the ORF probabilities from overlapping models in order to improve the predictions of MGC. This hypothesis is based on the empirical observation of Hansen and Krogh  that the error of an ensemble is the average error of each ensemble members minus a measure of the disagreement between each members. This suggests that the ensemble is always better than the individual average performance.
In our experiments, we have used simulated data derived from fully sequenced genomes. We plan to study the effect of sequencing errors on the prediction performance by simulating data with different error rates. Three types of errors can occur in all sequencing techniques: substitution, insertion, and deletion of one or more nucleotides during the reading process. Since we rely on codon and amino acid features to predict genes, any insertion or deletion will shift the frame of the sequence and thus alter the codon and amino acid compositions. In addition to evaluating MGC's prediction ability on sequences with these types of errors. We need to develop a way to compensate for the frame shifts, otherwise we will not be able to classify erroneous fragments. FragGeneScan currently shows the best performance for reads with errors. Once we address error modeling in MGC, we plan to compare our results with those of FragGeneScan.
This work is supported by the National Science Foundation under Grant No. DBI-0959427.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 9, 2013: Selected articles from the 8th International Symposium on Bioinformatics Research and Applications (ISBRA'12). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S9.
Publication of this article was funded by the authors.
- Hoff KJ, Lingner T, Meinicke P, Tech M: Orphelia: predicting genes in metagenomic sequencing reads. Nucleic acids research. 2009, 37 (Web Server): W101-5. 10.1093/nar/gkp327. [http://www.ncbi.nlm.nih.gov/pubmed/19429689]PubMed CentralView ArticlePubMedGoogle Scholar
- Allali AE, Rose JR: MIM: A Species Independent Approach for Classifying Coding and Non-Coding DNA Sequences in Bacterial and Archaeal Genomes. Engineering and Technology. 2010, 411-418.Google Scholar
- Chaisson MJ, Pevzner PA: Short read fragment assembly of bacterial genomes. Genome Research. 2008, 18 (2): 324-330. 10.1101/gr.7088808.PubMed CentralView ArticlePubMedGoogle Scholar
- Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB: ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Research. 2008, 18 (5): 810-820. 10.1101/gr.7337908. [http://www.ncbi.nlm.nih.gov/pubmed/18340039]PubMed CentralView ArticlePubMedGoogle Scholar
- Noguchi H, Park J, Takagi T: MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Research. 2006, 34 (19): 5623-5630. 10.1093/nar/gkl723. [http://www.ncbi.nlm.nih.gov/pubmed/17028096]PubMed CentralView ArticlePubMedGoogle Scholar
- Borodovsky M, Mills R, Besemer J, Lomsadze A: Prokaryotic gene prediction using GeneMark and GeneMark.hmm. Current protocols in bioinformatics editoral board Andreas D Baxevanis et al. 2003, Chapter 4:Unit4.5, [http://www.ncbi.nlm.nih.gov/pubmed/18428700]Google Scholar
- Rho M, Tang H, Ye Y: FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Research. 2010, 38 (20): e191-10.1093/nar/gkq747.PubMed CentralView ArticlePubMedGoogle Scholar
- Chan PK, Stolfo SJ: A comparative evaluation of voting and meta-learning on partitioned data. Proc 12th International Conference on Machine Learning. 1995, Morgan Kaufmann, 90-98. [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.47.7713]Google Scholar
- Singer GA, Hickey DA: Nucleotide bias causes a genomewide bias in the amino acid composition of proteins. Molecular Biology and Evolution. 2000, 17 (11): 1581-1588. 10.1093/oxfordjournals.molbev.a026257. [http://www.ncbi.nlm.nih.gov/pubmed/11070046]View ArticlePubMedGoogle Scholar
- Lightfield J, Fram NR, Ely B: Across Bacterial Phyla, Distantly-Related Genomes with Similar Genomic GC Content Have Similar Patterns of Amino Acid Usage. PLoS ONE. 2011, 6 (3): 12-View ArticleGoogle Scholar
- Oliver JL, Marín A: A relationship between GC content and coding-sequence length. Journal of Molecular Evolution. 1996, 43 (3): 216-223. 10.1007/BF02338829.View ArticlePubMedGoogle Scholar
- Misawa K, Kikuno RF: Relationship between amino acid composition and gene expression in the mouse genome. BMC research notes. 2011, 4: 20-10.1186/1756-0500-4-20.PubMed CentralView ArticlePubMedGoogle Scholar
- Wilkins MR, Pasquali C, Appel RD, Ou K, Golaz O, Sanchez JC, Yan JX, Gooley AA, Hughes G, Humphery-Smith I, Williams KL, Hochstrasser DF: From proteins to proteomes: large scale protein identification by two-dimensional electrophoresis and amino acid analysis. Biotechnology NY. 1996, 14: 61-65. 10.1038/nbt0196-61.View ArticleGoogle Scholar
- Hobohm U, Sander C: A sequence property approach to searching protein databases. Journal of Molecular Biology. 1995, 251 (3): 390-399. 10.1006/jmbi.1995.0442.View ArticlePubMedGoogle Scholar
- Guruprasad K, Reddy BV, Pandit MW: Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Engineering. 1990, 4 (2): 155-161. 10.1093/protein/4.2.155.View ArticlePubMedGoogle Scholar
- Merkl R: A survey of codon and amino acid frequency bias in microbial genomes focusing on translational efficiency. Journal of Molecular Evolution. 2003, 57 (4): 453-466. 10.1007/s00239-003-2499-1.View ArticlePubMedGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank: update. Nucleic Acids Research. 2004, 32 (Database): D23-D26. [http://www.ncbi.nlm.nih.gov/pubmed/14681350]PubMed CentralView ArticlePubMedGoogle Scholar
- Hoff KJ, Tech M, Lingner T, Daniel R, Morgenstern B, Meinicke P: Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC bioinformatics. 2008, 9: 217-10.1186/1471-2105-9-217. [http://www.ncbi.nlm.nih.gov/pubmed/18442389]PubMed CentralView ArticlePubMedGoogle Scholar
- MacKay DJC: A Practical Bayesian Framework for Backpropagation Networks. Neural Computation. 1992, 4 (3): 448-472. 10.1162/neco.19220.127.116.118. [http://www.mitpressjournals.org/doi/abs/10.1162/neco.1918.104.22.1688]View ArticleGoogle Scholar
- Nabney I: NETLAB: algorithms for pattern recognition (Google eBook). 2002, Springer, [http://www.springer.com/computer/ai/book/978-1-85233-440-6]Google Scholar
- Thomas T, Gilbert J, Meyer F: Metagenomics - a guide from sampling to data analysis. Microbial Informatics and Experimentation. 2012, 2: 3-10.1186/2042-5783-2-3. [http://www.microbialinformaticsj.com/content/2/1/3]PubMed CentralView ArticlePubMedGoogle Scholar
- Hansen JV, Krogh A: A general method for combining predictors tested on protein secondary structure prediction. Artificial Neural Networks in Medicine and Biology. 2000, 259-264. [http://link.springer.com/chapter/10.1007%2F978-1-4471-0513-8_39]View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.