Evading the annotation bottleneck: using sequence similarity to search non-sequence gene data

BMC Bioinformatics

Table 3 Comparison of text based and sequence based retrieval methods for image data for an arbitrary set of genes

				numbers of image sets retrieved for target gene using different methods
		test gene		sequence	text
frog species	current symbol	current full name	mRNA accession or other identifier	with full-length mRNA	with current gene symbol	with trial and error text terms	notes
X. laevis	chrd	chordin	NM_001088309	3	0	3
	hes1	hairy and enhancer of split 1	NM_001085917	1	1	1
	nog-A	noggin	NM_001085644	1	0	1
	Six1	homeobox protein SIX1	NM_001088558	1	1	1
X. tropicalis	bambi	BMP and activin membrane-bound inhibitor	NM_001008193	2	2	2
	bmp4	bone morphogenetic protein 4	Xt7.1-XZT65619.5.5	3	1	3	mRNA from Entrez Gene appears to be truncated, used EST-based contig sequence instead
	fgf8	fibroblast growth factor 8	NM_001008162	1	0	1
	lhx1	LIM homeobox 1	NM_001100228	2	0	2
	smarcd1	SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily d, member 1	NM_001004862	1	0	0	probe design sequences were in 3'UTR so there were no BLAST hits for text identification
	sox2	SRY (sex determining region Y)-box 2	NM_213704	4	0	4	alias gene symbol 'sox-2' worked better than 'sox2'
	t	T, brachyury homolog	NM_001008138	6	!!	6	a large number of protein descriptions contain the letter 't'
	tp53	tumor protein p53	NM_001001903	2	0	2	older alias gene symbol 'p53' retrieved both image sets

Image sets are defined by their associated sequence and source collection. Each associated sequence has been blasted against the NCBI protein databases, retaining the best match for human, mouse and the two frog species. Text based retrieval used simple text matching (allowing wild cards) against the protein description returned by BLAST. Sequence based retrieval used BLASTn against a database of the image associated sequences. For each gene the number of images sets retrieved by the sequence method, using the full-length mRNA, was noted. Text searches with various combinations of the gene symbol, exact full name, and more commonly used names, confirmed that the sequence method appeared to have retrieved all image sets for the target gene in each case. Care was taken to disambiguate search results on percent identity or protein description (as appropriate) by inspection, where images for other genes were retrieved along with the target gene images.

ISSN: 1471-2105