Skip to main content

Table 4 Comparison of extracted collocations

From: Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies

Corpus name

Corpus type

Size of corpus # words / # distinct words

#extracted collocations (TMI / PMI)

Average #documents per collocation

WSJ-400

Non-redundant

214 K / 19 K

551/565

20.2/19.9

WSJ-600

Non-redundant

309 K / 23.5 K

943/1,000

15.5/15.2

WSJ-1300

Non-redundant

680 K / 36 K

1,881/2,518

10.8/9.7

WSJs5

Synthetic Redundant

1.69 M (Ā±42 K)/36 K

3,035Ā±(63)/17,015Ā±(950)

7.4Ā±(0.11)/2.8Ā±(0.09)

  1. Comparison of extracted collocations on synthetic redundant corpora and non-redundant corpora (WSJ - X words / Y distinct words). Collocations were extracted using using True Mutual Information and Pointwise Mutual Information (with cutoffs of 0.001 and 0.01 respectively).