Semantic annotation of biological concepts interplaying microbial cellular responses

BMC Bioinformatics

Table 4 General statistics about the corpus of full-text documents

Categories		#concepts	# annotations	% concepts	% annotations	Annotation Frequency	Concept Distribution
Genetic Information Carrier	dna	126	3771	3.45%	6.39%	29.93	8.87
	rna	119	3970	3.26%	6.72%	33.36	8.38
	gene	1175	8770	32.20%	14.85%	7.46	82.75
Protein	protein	175	2332	4.80%	3.95%	13.33	28.69
	enzyme	388	4025	10.63%	6.82%	10.37	63.61
	transcription factor	47	1434	1.29%	2.43%	30.51	7.70
compound		767	21414	21.02%	36.27%	27.92
physiological state		403	10166	11.04%	17.22%	25.23
laboratory technique		449	3161	12.30%	5.35%	7.04
Total		3649	59043	100%	100%

The first statistics depict the number and percentage of biological concepts and associated annotations, and the frequency of annotations per category. Besides individual categories, there are hierarchically structured annotation categories: the categories dna, rna and gene belong to the supercategory genetic information carrier; and the categories protein, enzyme and transcription fac tor are subcategories of protein. For these categories, the concept distribution of a category is then calculated by dividing the number of biological concepts assigned to the category per the total number of biological concepts assigned to its supercategory.
Legend: The symbol "#" stands for "number of" and the symbol "%" stands for "percentage of". Frequencies are calculated as follows: $c o n c e p t d i s t r i b u t i o n = \frac{n u m b e r o f b i o l o g i c a l c o n c e p t s i n c a t e g o r y}{t o t a l n u m b e r o f b i o l o g i c a l c o n c e p t s i n s u p e r c a t e g o r y}$ $a n n o t a t i o n f r e q u e n c y = \frac{n u m b e r o f a n n o t a t i o n s i n c a t e g o r y}{n u m b e r o f b i o l o g i c a l c o n c e p t s i n c a t e g o r y}$

ISSN: 1471-2105