Validation and quality assurance for genome browser database exports
BMC Bioinformatics volume 16, Article number: P13 (2015)
A genome browser transition utility designed in our lab, FPD2GB2 (Fungal Project Database to GBrowse 2), exports data from a custom database used by the Fungal Endophytes Genome Project[1, 2]. Designed as a collection of scripts, FPD2GB2 outputs the contents of a locally developed genome annotation database into the standard GFF3 format, allowing for bulk import of data into the GBrowse2 genome browser. In short, FPD2GB2 is a collection of scripts designed to export data encoded in the Fungal Project Database format into a format which can be easily imported into GBrowse 2, namely GFF3.
Materials and methods
Any application which converts between data formats should ensure the completeness and accuracy of the output produced by FPD2GB2. Adding a data validator as part of the FPD2GB2 script collection allows for independent verification of the quality and soundness of the GFF3 files being imported into a production GBrowse2 environment.
We measure the accuracy of the output by comparing the features listed in the GFF3 files to the contents of the original database. Ensuring accurate offsets relative to reference features provides validation of accuracy. Comparing the parent-child inheritance structure of features in the output to that of the source data ensures the completeness of the output. The script collection is structured into a “master” script and several “worker” scripts, each of which produces its own output. The structure of the collection is shown in Figure 1. The goals and methods for the validator are described in Table 1.
It is notoriously difficult to prove accuracy of computational results and in practice validation is based on testing. In our case to validate the completeness, correctness and accuracy we use metrics which can not only give confidence that the output tends to accurately reflect the output, but also that the algorithms used to create the output are correct. The size of some of the databases and number of annotation tracks also makes full comparison of related tracks impractical, as fully comparing tracks takes a quadratic number of runs with respect to the number of tracks. Finally, because of the way the annotations do not have metadata establishing relationships, comparisons using ParsEval have to be run manually.
Fungal Endophytes Genome Project: [http://www.endophyte.uky.edu/]
Schardl CL, Young CA, Hesse U, Amyotte SG, Andreeva K, Calie PJ, et al: Plant-symbiotic fungi as chemical engineers: multi-genome analysis of the Clavicipitaceae reveals dynamics of alkaloid loci. PLoS Genetics. 2013, 9 (2): e1003323-
Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, et al: The Generic Genome Browser: A Building Block for a Model Organism System Database. Genome Research. 2002, 12 (10): 1599-1610.
About this article
Cite this article
Chui, R., Jaromczyk, J.W., Moore, N. et al. Validation and quality assurance for genome browser database exports. BMC Bioinformatics 16 (Suppl 15), P13 (2015). https://doi.org/10.1186/1471-2105-16-S15-P13
- Genome Browser
- Fungal Endophyte
- Annotation Database
- Quadratic Number
- Independent Verification