Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT

BMC Bioinformatics

Table 3 Summary of parameters used in the pre-training of BioALBERT

Summary of all parameters used: (pre-training)
Architecture	ALBERT
Activation function	GeLU
Attention heads	12
No. of layers	12
Size of hidden layer	768
Size of embedding	128
Size of vocabulary	30k
Optimizer used	LAMB
Training batch size	1024 for base models 256 for large models
Evaluation batch size	16
Maximum sentence length	512
Maximum predictions per sentence	20
Warm-up steps	3125

ISSN: 1471-2105