From: Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies
Corpus
# Documents
# Words / # Unique Words
WSJ-400
400
214 K / 19 K
WSJ-600
600
309 K / 23.5 K
WSJ-1300
1,300
680 K / 36 K
WSJx2
2,600
1.3 M words / 36 K
WSJx3
3,900
2.6 M words / 36 K
WSJs5
3,246(Ā±40)
1.69 M (Ā±42 K) words / 36 K