BERT-MLM

BERT-base (110M params) trained from scratch with the classic masked language modeling (MLM) objective from Devlin et al., 2018.

This model is part of a paired experiment comparing classic BERT MLM training against modern diffusion language model (DLM) training. See AntonXue/BERT-DLM for the counterpart.

Training Objective

Standard BERT MLM: 15% of tokens selected as targets, with 80/10/10 corruption (80% replaced with [MASK], 10% random token, 10% unchanged). Cross-entropy loss on target positions only.

Dataset

BookCorpusOpen () — ~17K books
English Wikipedia (, 20231101.en) — ~6.4M articles
Split: 95/5 train/eval on raw documents, then tokenized and packed into 512-token sequences (no padding)
Train sequences: 10,784,085
Total train tokens: 5.52B

Training Configuration

Parameter	Value
Architecture	(fresh random init)
Parameters	109.5M
Sequence length	512
Global batch size	256 (128 per GPU x 2 GPUs)
Training steps	100,000
Tokens seen	~13.1B
Optimizer	AdamW
Learning rate	1e-4
LR schedule	Constant with warmup
Warmup steps	500
Adam betas	(0.9, 0.999)
Weight decay	0.01
Max grad norm	1.0
Precision	bf16
Hardware	2x NVIDIA H100 NVL

Usage

Code

Training code: github.com/AntonXue/dBERT

Downloads last month: 54

Safetensors

Model size

0.1B params

Tensor type

F32

Datasets used to train AntonXue/BERT-MLM

Paper for AntonXue/BERT-MLM

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Paper • 1810.04805 • Published Oct 11, 2018 • 26