BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 26
BERT-base (110M params) trained from scratch with the classic masked language modeling (MLM) objective from Devlin et al., 2018.
This model is part of a paired experiment comparing classic BERT MLM training against modern diffusion language model (DLM) training. See AntonXue/BERT-DLM for the counterpart.
Standard BERT MLM: 15% of tokens selected as targets, with 80/10/10 corruption (80% replaced with [MASK], 10% random token, 10% unchanged). Cross-entropy loss on target positions only.
| Parameter | Value |
|---|---|
| Architecture | (fresh random init) |
| Parameters | 109.5M |
| Sequence length | 512 |
| Global batch size | 256 (128 per GPU x 2 GPUs) |
| Training steps | 100,000 |
| Tokens seen | ~13.1B |
| Optimizer | AdamW |
| Learning rate | 1e-4 |
| LR schedule | Constant with warmup |
| Warmup steps | 500 |
| Adam betas | (0.9, 0.999) |
| Weight decay | 0.01 |
| Max grad norm | 1.0 |
| Precision | bf16 |
| Hardware | 2x NVIDIA H100 NVL |
Training code: github.com/AntonXue/dBERT