Next generation genome annotation with mGene.ngs
© Behr et al; licensee BioMed Central Ltd. 2010
Published: 07 December 2010
An increasingly large number of novel genomes is being sequenced and the task of automatic genome annotation has never been more important. The current revolution in sequencing technologies also allows us to obtain a detailed picture of the whole complement of expressed RNA transcripts. We have developed a novel de novo gene finding system mGene.ngs that combines the benefits of accurate ab initio gene finding with the rich information obtained in RNA sequencing (RNA-seq) experiments.
The system is based on the recently developed accurate gene finding system mGene , which employs state-of-the-art prediction techniques and which has been shown to perform very well compared to established gene finding systems . In contrast to many HMM-based gene finders, mGene has the conceptual advantage of being very flexible in terms of incorporating heterogeneous input data. The employed inference techniques can exploit the transcriptome information already at the learning stage to appropriately adapt to the relevance of the different evidences. We show that these advantages can be translated into more accurate gene predictions. Moreover, we developed extensions of mGene.ngs to predict and quantify alternative RNA transcripts.
To provide de novo genome annotations based on RNA-seq experiments, we first construct a preliminary, highly specific gene set for genes that are well-covered with RNA-seq reads. In a second step, we train predictors for genomic signals on the preliminary gene set. In the third step we train mGene.ngs, using the preliminary gene models while taking advantage of the RNA-seq read coverage and genomic signal predictions.
Investigating the contribution of individual features we found that spliced read alignments suggesting introns help most to increase the gene prediction performance; 91.6% of the achieved total improvement is due to spliced read alignments. The read coverage alone is much less informative and only leads to improvements similar to the ones achieved with transcriptome tiling arrays. We employed the developed annotation strategy for the re-annotation of the C. briggsae genome, for which only few transcriptome sequences are available yet. We can show that the new annotation is considerably more accurate than previous ones and additionally includes alternative RNA isoforms.
mGene.ngs will be released as open source software on http://mgene.org and is already available as
Galaxy-based web-service at http://galaxy.fml.mpg.de.
- Schweikert , et al.: mGene: Accurate SVM-based gene finding. Genome Research 2009, 19: 2133–2143. 10.1101/gr.090597.108PubMed CentralView ArticlePubMedGoogle Scholar
- Coghlan , et al.: nGASP: The nematode genome annotation assessment project. BMC Bioinformatics 2008, 9: 549. 10.1186/1471-2105-9-549PubMed CentralView ArticlePubMedGoogle Scholar
- Trapnell , et al.: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology 2010. doi:10.1038/nbt.1621 doi:10.1038/nbt.1621Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.