Skip to main content

Table 8 Using the BioLexicon for fact extraction

From: The BioLexicon: a large-scale terminological resource for biomedical text mining

Year

Total lexical verbs

Verbs in BL

(% lex. verbs)

Facts extracted

(% BL verbs)

Gramm. frame mismatch

(% BL verbs)

Absence of NE in args

(% BL verbs)

Facts with prep. args

(% of facts)

2001

6,637,052

4,083,325

1,000,571

89,719

2,993,038

187,493

  

(61.5%)

(24.5%)

(2.2%)

(73.3%)

(18.7%)

2002

13,412,793

8,694,065

2,417,809

194,986

6,081,289

493,962

  

(64.8%)

(27.8%)

(2.2%)

(69.9%)

(20.4%)

2004

6,811,428

4,201,550

742,621

89,249

3,369,690

129,600

  

(61.7%)

(17.7%)

(2.1%)

(80.2%)

(17.5%)

  1. The table provides statistics regarding the use of the BioLexicon to identify biomedical facts in approximately 80,000 full text articles taken from the UKPMC corpus. For each of the 3 years represented in this sub-corpus, the total number of lexical (i.e., non-auxiliary) verbs for that year is shown. In the next column, the number of these lexical verbs that match with entries in the BioLexicon (BL) is provided. The figure is also shown as a percentage of the total number of lexical verbs. Next to this is the total number of verbs that are extracted as representing facts (i.e., following the filtering steps), displayed both as an absolute figure and as a percentage of the total number of verbs in the BL. The next two columns display the number of verbs filtered out during the two filtering steps: firstly, the verbs whose grammatical frame does not match with the ones specified in the BL, and secondly, verbs whose arguments do not contain any NEs. The final column displays the number of facts extracted that have one or more arguments corresponding to prepositional phrases, both as a raw figure and as a percentage of the total number of facts extracted.