ParsEval: parallel comparison and analysis of gene structure annotations

BMC Bioinformatics

Table 4 Annotation comparison methods

	*A. thaliana*		*D. melanogaster*		*G. max*		*H. sapiens*
Reference annotations	TAIR9		FlyBase 5.39		NCBI Entrez		UCSC knownGene (hg19)
Prediction annotations	TAIR10		Ensembl r65		JGI / Phytozome		Ensembl r65
Average runtime (sec)	Text	HTML	Text	HTML	Text	HTML	Text	HTML
n=1	36.3	859.4	91.1	1,350.5	85.3	1,461.1	294.3	6,422.0
n=2	32.8	449.2	56.6	859.5	79.4	768.4	181.3	4,089.5
n=4	30.7	246.5	39.2	633.7	76.5	439.9	130.1	2,751.2
n=8	29.8	168.7	32.4	546.6	76.3	330.5	108.0	2,323.3
Gene loci	25,618		10,976		47,877		17,865
shared	25,590		10,944		37,942		7,779
unique to reference	6		32		3,363		9,569
unique to prediction	22		0		6,572		517
Comparisons	33,002		22,474		38,734		16,168
perfect matches	31,750	96.2%	22,446	99.9%	2,489	6.4%	2,517	15.6%
CDS structure matches	420	1.3%	0	0.0%	17,450	45.1%	8,269	51.1%
exon structure matches	8	0.0%	21	0.1%	26	0.1%	27	0.2%
UTR structure matches	159	0.5%	1	0.0%	647	1.7%	58	0.4%
non-matches	665	2.0%	6	0.0%	18,122	46.8%	5,297	32.8%

As a demonstration of ParsEval’s speed and scalability, we obtained pairs of whole-genome annotations for Arabidopsis thaliana (thale cress), Drosophila melanogaster (fruit fly), Glycine max (soybean), and Homo sapiens (human) For each organism, we used ParsEval to compare the two corresponding sets of annotations. Runtimes are shown for both text and HTML/PNG output modes, using 1, 2, 4, and 8 processors. For each organism, we also show the number of gene loci identified, how many were shared between the two sets of annotations, and how many are unique to one set. Finally, we show the number of reported comparisons for each organismand how many were perfect gene structure matches, how many were CDS structure matches, and how many were non-matches. All of the results shown in this table were easily obtained from the summary reports generated by ParsEval.

ISSN: 1471-2105