Skip to main content

Table 3 Comparison of text based and sequence based retrieval methods for image data for an arbitrary set of genes

From: Evading the annotation bottleneck: using sequence similarity to search non-sequence gene data

    

numbers of image sets retrieved for target gene using different methods

 
  

test gene

 

sequence

text

 

frog species

current symbol

current full name

mRNA accession or other identifier

with full-length mRNA

with current gene symbol

with trial and error text terms

notes

X. laevis

chrd

chordin

NM_001088309

3

0

3

 
 

hes1

hairy and enhancer of split 1

NM_001085917

1

1

1

 
 

nog-A

noggin

NM_001085644

1

0

1

 
 

Six1

homeobox protein SIX1

NM_001088558

1

1

1

 

X. tropicalis

bambi

BMP and activin membrane-bound inhibitor

NM_001008193

2

2

2

 
 

bmp4

bone morphogenetic protein 4

Xt7.1-XZT65619.5.5

3

1

3

mRNA from Entrez Gene appears to be truncated, used EST-based contig sequence instead

 

fgf8

fibroblast growth factor 8

NM_001008162

1

0

1

 
 

lhx1

LIM homeobox 1

NM_001100228

2

0

2

 
 

smarcd1

SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily d, member 1

NM_001004862

1

0

0

probe design sequences were in 3'UTR so there were no BLAST hits for text identification

 

sox2

SRY (sex determining region Y)-box 2

NM_213704

4

0

4

alias gene symbol 'sox-2' worked better than 'sox2'

 

t

T, brachyury homolog

NM_001008138

6

!!

6

a large number of protein descriptions contain the letter 't'

 

tp53

tumor protein p53

NM_001001903

2

0

2

older alias gene symbol 'p53' retrieved both image sets

  1. Image sets are defined by their associated sequence and source collection. Each associated sequence has been blasted against the NCBI protein databases, retaining the best match for human, mouse and the two frog species. Text based retrieval used simple text matching (allowing wild cards) against the protein description returned by BLAST. Sequence based retrieval used BLASTn against a database of the image associated sequences. For each gene the number of images sets retrieved by the sequence method, using the full-length mRNA, was noted. Text searches with various combinations of the gene symbol, exact full name, and more commonly used names, confirmed that the sequence method appeared to have retrieved all image sets for the target gene in each case. Care was taken to disambiguate search results on percent identity or protein description (as appropriate) by inspection, where images for other genes were retrieved along with the target gene images.