Skip to main content
Figure 1 | BMC Bioinformatics

Figure 1

From: Mining locus tags in PubMed Central to improve microbial gene annotation

Figure 1

Flowchart for mining locus tags using the pmcXML package. R functions are indicated by solid lines, inputs by dash lines, and NCBI databases and R objects by boxes. For each species, NCBI Genomes is used to find the reference strain and download the GFF3 file. The locus tag prefixes in the GFF3 files are used to format a search query in PubMed Central and find matching references. For each reference, the PMC id is used to download the XML document which is then parsed into full text and tables. The XML file includes links to supplements that are downloaded separately, but typically require additional code to reformat (therefore, only locus tags within supplements from B. pseudomallei were extracted). Finally, the locus tags are used to create a pattern string to extract tags and also expand locus tag pairs marking the start and end of a region. The R functions developed specifically for this effort are described in Additional file 2.

Back to article page