Skip to main content

Table 8 Corpora Descriptive statistics

From: Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies

Corpus

# Documents

# Words / # Unique Words

WSJ-400

400

214 K / 19 K

WSJ-600

600

309 K / 23.5 K

WSJ-1300

1,300

680 K / 36 K

WSJx2

2,600

1.3 M words / 36 K

WSJx3

3,900

2.6 M words / 36 K

WSJs5

3,246(Ā±40)

1.69 M (Ā±42 K) words / 36 K

  1. Synthetic corpora with various levels of redundancy , for WSJs5 we report averages and standard deviation based on 10 replications.