Skip to main content

Table 4 Characteristics of the PMC Open Access dataset

From: Neural sentence embedding models for semantic similarity estimation in the biomedical domain

File size

45 GB

Number of articles

>  1,700,000

Total number of tokens

8,126,457,106

Number of unique words

31,974,798

Number of sentences

277,809,416

Average line length before post-processing (number of characters)

162

Longest line length before post-processing (number of characters)

111,562