Skip to main content

Table 4 Comparison of extracted collocations

From: Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies

Corpus name Corpus type Size of corpus # words / # distinct words #extracted collocations (TMI / PMI) Average #documents per collocation
WSJ-400 Non-redundant 214 K / 19 K 551/565 20.2/19.9
WSJ-600 Non-redundant 309 K / 23.5 K 943/1,000 15.5/15.2
WSJ-1300 Non-redundant 680 K / 36 K 1,881/2,518 10.8/9.7
WSJs5 Synthetic Redundant 1.69 M (±42 K)/36 K 3,035±(63)/17,015±(950) 7.4±(0.11)/2.8±(0.09)
  1. Comparison of extracted collocations on synthetic redundant corpora and non-redundant corpora (WSJ - X words / Y distinct words). Collocations were extracted using using True Mutual Information and Pointwise Mutual Information (with cutoffs of 0.001 and 0.01 respectively).