From: Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT
Summary of all parameters used: (pre-training) | |
---|---|
Architecture | ALBERT |
Activation function | GeLU |
Attention heads | 12 |
No. of layers | 12 |
Size of hidden layer | 768 |
Size of embedding | 128 |
Size of vocabulary | 30k |
Optimizer used | LAMB |
Training batch size | 1024 for base models 256 for large models |
Evaluation batch size | 16 |
Maximum sentence length | 512 |
Maximum predictions per sentence | 20 |
Warm-up steps | 3125 |