Skip to main content

Table 4 Characteristics of the PMC Open Access dataset

From: Neural sentence embedding models for semantic similarity estimation in the biomedical domain

File size 45 GB
Number of articles >  1,700,000
Total number of tokens 8,126,457,106
Number of unique words 31,974,798
Number of sentences 277,809,416
Average line length before post-processing (number of characters) 162
Longest line length before post-processing (number of characters) 111,562