Skip to main content

Table 4 General statistics about the corpus of full-text documents

From: Semantic annotation of biological concepts interplaying microbial cellular responses

Categories #concepts # annotations % concepts % annotations Annotation Frequency Concept Distribution
Genetic Information Carrier dna 126 3771 3.45% 6.39% 29.93 8.87
  rna 119 3970 3.26% 6.72% 33.36 8.38
  gene 1175 8770 32.20% 14.85% 7.46 82.75
Protein protein 175 2332 4.80% 3.95% 13.33 28.69
  enzyme 388 4025 10.63% 6.82% 10.37 63.61
  transcription factor 47 1434 1.29% 2.43% 30.51 7.70
compound 767 21414 21.02% 36.27% 27.92  
physiological state 403 10166 11.04% 17.22% 25.23  
laboratory technique 449 3161 12.30% 5.35% 7.04  
Total 3649 59043 100% 100%   
  1. The first statistics depict the number and percentage of biological concepts and associated annotations, and the frequency of annotations per category. Besides individual categories, there are hierarchically structured annotation categories: the categories dna, rna and gene belong to the supercategory genetic information carrier; and the categories protein, enzyme and transcription fac tor are subcategories of protein. For these categories, the concept distribution of a category is then calculated by dividing the number of biological concepts assigned to the category per the total number of biological concepts assigned to its supercategory.
  2. Legend: The symbol "#" stands for "number of" and the symbol "%" stands for "percentage of". Frequencies are calculated as follows: c o n c e p t d i s t r i b u t i o n = n u m b e r o f b i o l o g i c a l c o n c e p t s i n c a t e g o r y t o t a l n u m b e r o f b i o l o g i c a l c o n c e p t s i n s u p e r c a t e g o r y a n n o t a t i o n f r e q u e n c y = n u m b e r o f a n n o t a t i o n s i n c a t e g o r y n u m b e r o f b i o l o g i c a l c o n c e p t s i n c a t e g o r y