Next generation genome annotation with mGene.ngs

Behr, Jonas; Bohnert, Regina; Zeller, Georg; Schweikert, Gabriele; Hartmann, Lisa; Rätsch, Gunnar

doi:10.1186/1471-2105-11-S10-O8

Volume 11 Supplement 10

Highlights from the Sixth International Society for Computational Biology (ISCB) Student Council Symposium

Oral presentation
Open access
Published: 07 December 2010

Next generation genome annotation with mGene.ngs

Jonas Behr¹,
Regina Bohnert¹,
Georg Zeller^1,2,
Gabriele Schweikert^1,2,3,
Lisa Hartmann¹ &
…
Gunnar Rätsch¹

BMC Bioinformatics volume 11, Article number: O8 (2010) Cite this article

4084 Accesses
6 Citations
Metrics details

An increasingly large number of novel genomes is being sequenced and the task of automatic genome annotation has never been more important. The current revolution in sequencing technologies also allows us to obtain a detailed picture of the whole complement of expressed RNA transcripts. We have developed a novel de novo gene finding system mGene.ngs that combines the benefits of accurate ab initio gene finding with the rich information obtained in RNA sequencing (RNA-seq) experiments.

The system is based on the recently developed accurate gene finding system mGene [1], which employs state-of-the-art prediction techniques and which has been shown to perform very well compared to established gene finding systems [2]. In contrast to many HMM-based gene finders, mGene has the conceptual advantage of being very flexible in terms of incorporating heterogeneous input data. The employed inference techniques can exploit the transcriptome information already at the learning stage to appropriately adapt to the relevance of the different evidences. We show that these advantages can be translated into more accurate gene predictions. Moreover, we developed extensions of mGene.ngs to predict and quantify alternative RNA transcripts.

To provide de novo genome annotations based on RNA-seq experiments, we first construct a preliminary, highly specific gene set for genes that are well-covered with RNA-seq reads. In a second step, we train predictors for genomic signals on the preliminary gene set. In the third step we train mGene.ngs, using the preliminary gene models while taking advantage of the RNA-seq read coverage and genomic signal predictions.

We illustrate the power of our approach for the C. elegans genome and 50M paired-end RNA-seq reads (Illumina; 76nt). Figure 1 shows transcript level evaluation results for all annotated genes (WS200) as a function of the expression level. The ab initio mGene-based system (blue) trained on the annotation achieves an average transcript-level F-score of 49.9%. We achieve a slightly better performance (51.8%) for the de novo annotation system (green) using RNA-seq reads, but without considering the existing genome annotation. If we use the RNA-seq reads and train on the existing annotation (red), we achieve 57.6%, and can therefore take advantage of the previous annotation. We find it remarkable that for medium to high expressed genes the de novo gene predictions are as similar to the genome annotation as the predictions of the system, that has seen parts of the annotation in training. Comparing these results to predictions from the recently published method cufflinks [3] (black) reveals that cufflinks seems not to be able to appropriately adapt to the RNA-seq data at hand.

Investigating the contribution of individual features we found that spliced read alignments suggesting introns help most to increase the gene prediction performance; 91.6% of the achieved total improvement is due to spliced read alignments. The read coverage alone is much less informative and only leads to improvements similar to the ones achieved with transcriptome tiling arrays. We employed the developed annotation strategy for the re-annotation of the C. briggsae genome, for which only few transcriptome sequences are available yet. We can show that the new annotation is considerably more accurate than previous ones and additionally includes alternative RNA isoforms.

mGene.ngs will be released as open source software on http://mgene.org and is already available as

Galaxy-based web-service at http://galaxy.fml.mpg.de.

References

Schweikert , et al.: mGene: Accurate SVM-based gene finding. Genome Research 2009, 19: 2133–2143. 10.1101/gr.090597.108
Article PubMed Central CAS PubMed Google Scholar
Coghlan , et al.: nGASP: The nematode genome annotation assessment project. BMC Bioinformatics 2008, 9: 549. 10.1186/1471-2105-9-549
Article PubMed Central PubMed Google Scholar
Trapnell , et al.: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology 2010. doi:10.1038/nbt.1621 doi:10.1038/nbt.1621
Google Scholar

Download references

Author information

Authors and Affiliations

Friedrich Miescher Laboratory of the Max Planck Society, Tübingen, Germany
Jonas Behr, Regina Bohnert, Georg Zeller, Gabriele Schweikert, Lisa Hartmann & Gunnar Rätsch
Max Planck Institute for Developmental Biology, Tübingen, Germany
Georg Zeller & Gabriele Schweikert
Max Planck Institute for Biological Cybernetics, Tübingen, Germany
Gabriele Schweikert

Authors

Jonas Behr
View author publications
You can also search for this author in PubMed Google Scholar
Regina Bohnert
View author publications
You can also search for this author in PubMed Google Scholar
Georg Zeller
View author publications
You can also search for this author in PubMed Google Scholar
Gabriele Schweikert
View author publications
You can also search for this author in PubMed Google Scholar
Lisa Hartmann
View author publications
You can also search for this author in PubMed Google Scholar
Gunnar Rätsch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jonas Behr.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Behr, J., Bohnert, R., Zeller, G. et al. Next generation genome annotation with mGene.ngs. BMC Bioinformatics 11 (Suppl 10), O8 (2010). https://doi.org/10.1186/1471-2105-11-S10-O8

Download citation

Published: 07 December 2010
DOI: https://doi.org/10.1186/1471-2105-11-S10-O8

Highlights from the Sixth International Society for Computational Biology (ISCB) Student Council Symposium

Next generation genome annotation with mGene.ngs

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

BMC Bioinformatics

Contact us

Highlights from the Sixth International Society for Computational Biology (ISCB) Student Council Symposium

Next generation genome annotation with mGene.ngs

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us