Skip to main content

Table 5 Descriptive statistics of the patient notes corpora

From: Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies

Corpus

# Notes

# Words

# Concepts

All Informative (input)

8,557

6,131,879

599,847

Last Informative Note (baseline)

1,247

435,387

44,145

Selective- Fingerprinting maximum similarity 0.33

4,524

3,614,409

337,034

Selective-Fingerprinting maximum similarity 0.25

3,970

3,283,558

302,159

Selective-Fingerprinting maximum similarity 0.20

3,645

3,061,854

278,644

  1. All Informative, input corpus, the corpus obtained by the redundancy reduction baseline (Last Informative Note), and the corpora produced by the fingerprinting redundancy reduction strategy at different level.