Skip to main content

Table 3 Summary of parameters used in the pre-training of BioALBERT

From: Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT

Summary of all parameters used: (pre-training)

Architecture

ALBERT

Activation function

GeLU

Attention heads

12

No. of layers

12

Size of hidden layer

768

Size of embedding

128

Size of vocabulary

30k

Optimizer used

LAMB

Training batch size

1024 for base models 256 for large models

Evaluation batch size

16

Maximum sentence length

512

Maximum predictions per sentence

20

Warm-up steps

3125