Neural sentence embedding models for semantic similarity estimation in the biomedical domain

BMC Bioinformatics

Table 4 Characteristics of the PMC Open Access dataset

File size	45 GB
Number of articles	> 1,700,000
Total number of tokens	8,126,457,106
Number of unique words	31,974,798
Number of sentences	277,809,416
Average line length before post-processing (number of characters)	162
Longest line length before post-processing (number of characters)	111,562

ISSN: 1471-2105