Skip to main content

Table 2 Collocations found in redundant and non-redundant corpora

From: Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies

  All informative (redundant) Last informative (non-redundant)
Word Types 81,928 40,774
Words 3,641,031 545,231
Collocations 15,814 2,527
Collocations/Word 0.004 0.004
Avg. number of patients per collocation 18.2 66
% collocations that appear in notes of 3 patients or less 36 % 1 %
  1. Collocations were extracted using a stringent cutoff of 0.001 PMI.