Skip to main content

Table 5 Descriptive statistics of the patient notes corpora

From: Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies

Corpus # Notes # Words # Concepts
All Informative (input) 8,557 6,131,879 599,847
Last Informative Note (baseline) 1,247 435,387 44,145
Selective- Fingerprinting maximum similarity 0.33 4,524 3,614,409 337,034
Selective-Fingerprinting maximum similarity 0.25 3,970 3,283,558 302,159
Selective-Fingerprinting maximum similarity 0.20 3,645 3,061,854 278,644
  1. All Informative, input corpus, the corpus obtained by the redundancy reduction baseline (Last Informative Note), and the corpora produced by the fingerprinting redundancy reduction strategy at different level.