Skip to main content

Validation and quality assurance for genome browser database exports

Background

A genome browser transition utility designed in our lab, FPD2GB2 (Fungal Project Database to GBrowse 2), exports data from a custom database used by the Fungal Endophytes Genome Project[1, 2]. Designed as a collection of scripts, FPD2GB2 outputs the contents of a locally developed genome annotation database into the standard GFF3 format, allowing for bulk import of data into the GBrowse2 genome browser[3]. In short, FPD2GB2 is a collection of scripts designed to export data encoded in the Fungal Project Database format into a format which can be easily imported into GBrowse 2, namely GFF3.

Materials and methods

Any application which converts between data formats should ensure the completeness and accuracy of the output produced by FPD2GB2. Adding a data validator as part of the FPD2GB2 script collection allows for independent verification of the quality and soundness of the GFF3 files being imported into a production GBrowse2 environment.

We measure the accuracy of the output by comparing the features listed in the GFF3 files to the contents of the original database. Ensuring accurate offsets relative to reference features provides validation of accuracy. Comparing the parent-child inheritance structure of features in the output to that of the source data ensures the completeness of the output. The script collection is structured into a “master” script and several “worker” scripts, each of which produces its own output. The structure of the collection is shown in Figure 1. The goals and methods for the validator are described in Table 1.

Figure 1
figure 1

Block diagram of FPD2GB2 data flow and execution.

Table 1 Goals and methods for the validator.

Results

It is notoriously difficult to prove accuracy of computational results and in practice validation is based on testing. In our case to validate the completeness, correctness and accuracy we use metrics which can not only give confidence that the output tends to accurately reflect the output, but also that the algorithms used to create the output are correct. The size of some of the databases and number of annotation tracks also makes full comparison of related tracks impractical, as fully comparing tracks takes a quadratic number of runs with respect to the number of tracks. Finally, because of the way the annotations do not have metadata establishing relationships, comparisons using ParsEval have to be run manually.

References

  1. Fungal Endophytes Genome Project: [http://www.endophyte.uky.edu/]

  2. Schardl CL, Young CA, Hesse U, Amyotte SG, Andreeva K, Calie PJ, et al: Plant-symbiotic fungi as chemical engineers: multi-genome analysis of the Clavicipitaceae reveals dynamics of alkaloid loci. PLoS Genetics. 2013, 9 (2): e1003323-

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  3. Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, et al: The Generic Genome Browser: A Building Block for a Model Organism System Database. Genome Research. 2002, 12 (10): 1599-1610.

    PubMed  CAS  PubMed Central  Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jerzy W Jaromczyk.

Rights and permissions

Open Access  This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chui, R., Jaromczyk, J.W., Moore, N. et al. Validation and quality assurance for genome browser database exports. BMC Bioinformatics 16, P13 (2015). https://doi.org/10.1186/1471-2105-16-S15-P13

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-16-S15-P13

Keywords

  • Genome Browser
  • Fungal Endophyte
  • Annotation Database
  • Quadratic Number
  • Independent Verification