Corpus name | Corpus type | Size of corpus # words / # distinct words | #extracted collocations (TMI / PMI) | Average #documents per collocation |
---|---|---|---|---|
WSJ-400 | Non-redundant | 214 K / 19 K | 551/565 | 20.2/19.9 |
WSJ-600 | Non-redundant | 309 K / 23.5 K | 943/1,000 | 15.5/15.2 |
WSJ-1300 | Non-redundant | 680 K / 36 K | 1,881/2,518 | 10.8/9.7 |
WSJs5 | Synthetic Redundant | 1.69 M (Ā±42 K)/36 K | 3,035Ā±(63)/17,015Ā±(950) | 7.4Ā±(0.11)/2.8Ā±(0.09) |