BERT-MLM

BERT-base (110M params) trained from scratch with the classic masked language modeling (MLM) objective from Devlin et al., 2018.

This model is part of a paired experiment comparing classic BERT MLM training against modern diffusion language model (DLM) training. See AntonXue/BERT-DLM for the counterpart.

Training Objective

Standard BERT MLM: 15% of tokens selected as targets, with 80/10/10 corruption (80% replaced with [MASK], 10% random token, 10% unchanged). Cross-entropy loss on target positions only.

Dataset

  • BookCorpusOpen () — ~17K books
  • English Wikipedia (, 20231101.en) — ~6.4M articles
  • Split: 95/5 train/eval on raw documents, then tokenized and packed into 512-token sequences (no padding)
  • Train sequences: 10,784,085
  • Total train tokens: 5.52B

Training Configuration

Parameter Value
Architecture (fresh random init)
Parameters 109.5M
Sequence length 512
Global batch size 256 (128 per GPU x 2 GPUs)
Training steps 100,000
Tokens seen ~13.1B
Optimizer AdamW
Learning rate 1e-4
LR schedule Constant with warmup
Warmup steps 500
Adam betas (0.9, 0.999)
Weight decay 0.01
Max grad norm 1.0
Precision bf16
Hardware 2x NVIDIA H100 NVL

Usage

Code

Training code: github.com/AntonXue/dBERT

Downloads last month
54
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train AntonXue/BERT-MLM

Paper for AntonXue/BERT-MLM