Skip to main content

Table 4 General statistics about the corpus of full-text documents

From: Semantic annotation of biological concepts interplaying microbial cellular responses

Categories

#concepts

# annotations

% concepts

% annotations

Annotation Frequency

Concept Distribution

Genetic Information Carrier

dna

126

3771

3.45%

6.39%

29.93

8.87

 

rna

119

3970

3.26%

6.72%

33.36

8.38

 

gene

1175

8770

32.20%

14.85%

7.46

82.75

Protein

protein

175

2332

4.80%

3.95%

13.33

28.69

 

enzyme

388

4025

10.63%

6.82%

10.37

63.61

 

transcription factor

47

1434

1.29%

2.43%

30.51

7.70

compound

767

21414

21.02%

36.27%

27.92

 

physiological state

403

10166

11.04%

17.22%

25.23

 

laboratory technique

449

3161

12.30%

5.35%

7.04

 

Total

3649

59043

100%

100%

  
  1. The first statistics depict the number and percentage of biological concepts and associated annotations, and the frequency of annotations per category. Besides individual categories, there are hierarchically structured annotation categories: the categories dna, rna and gene belong to the supercategory genetic information carrier; and the categories protein, enzyme and transcription fac tor are subcategories of protein. For these categories, the concept distribution of a category is then calculated by dividing the number of biological concepts assigned to the category per the total number of biological concepts assigned to its supercategory.
  2. Legend: The symbol "#" stands for "number of" and the symbol "%" stands for "percentage of". Frequencies are calculated as follows: c o n c e p t d i s t r i b u t i o n = n u m b e r o f b i o l o g i c a l c o n c e p t s i n c a t e g o r y t o t a l n u m b e r o f b i o l o g i c a l c o n c e p t s i n s u p e r c a t e g o r y a n n o t a t i o n f r e q u e n c y = n u m b e r o f a n n o t a t i o n s i n c a t e g o r y n u m b e r o f b i o l o g i c a l c o n c e p t s i n c a t e g o r y