GPT-2 (124M) - Research Ablation Baseline
Model Summary
This is a 124M parameter Causal Language Model (GPT-2 Small architecture) trained entirely from scratch using PyTorch.
It was created as a baseline for a research ablation study to investigate training dynamics, achieving a validation loss of 4.485.
Model Details
- Architecture: Custom GPT-2 Small (Decoder-only Transformer)
- Parameters: 124M
- Context Window: 256 tokens
- Dimensions: Embedding: 768, Heads: 12, Layers: 12
- Training Steps: ~28,500
- Validation Loss: 4.485
How to Use
⚠️ Important: Because this model was trained using a custom PyTorch class (not the standard Hugging Face GPT2LMHeadModel), you must define the model architecture in your code before loading the weights.
- Downloads last month
- 8