Skip to main content

Table 8 Corpora Descriptive statistics

From: Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies

Corpus # Documents # Words / # Unique Words
WSJ-400 400 214 K / 19 K
WSJ-600 600 309 K / 23.5 K
WSJ-1300 1,300 680 K / 36 K
WSJx2 2,600 1.3 M words / 36 K
WSJx3 3,900 2.6 M words / 36 K
WSJs5 3,246(±40) 1.69 M (±42 K) words / 36 K
  1. Synthetic corpora with various levels of redundancy , for WSJs5 we report averages and standard deviation based on 10 replications.