flan-t5-base-mixed-1-1-catastrophic

Model trained as part of: "Mitigating Catastrophic Forgetting in Mathematical Reasoning Finetuning through Mixed Training"

This model investigates catastrophic forgetting when finetuning language models for specialized tasks. We demonstrate that math-only training causes severe NLI degradation (81% → 16.5%), while mixed training eliminates this forgetting while maintaining equivalent mathematical performance.

Quick Links

📄 Paper: arXiv (to be updated after submission)
💻 Code: GitHub Repository
🤗 Model Collection: All experiment checkpoints

Model Description

This is the final checkpoint after completing 3 epochs of training.

This checkpoint represents the mixed-1-1-final training configuration from our systematic study of catastrophic forgetting mitigation strategies.

Training Configuration

Base Model: google/flan-t5-base (250M parameters)
Training Type: Mixed (1:1 Math:NLI)
Math Dataset: DeepMind Mathematics dataset (algebra__linear_1d subset), 392,702 training examples
NLI Dataset: MultiNLI (matched + mismatched splits), 392,702 training examples
Training Details:
- Learning rate: 3e-4 with cosine decay
- Warmup: 6% of total steps
- Epochs: 3
- Effective batch size: 256 examples
- Precision: bfloat16
- Optimizer: FusedAdam
- Hardware: Single NVIDIA A100 (40GB)

This model was trained with a 1:1 mixing ratio, meaning 50.0% math examples and 50.0% NLI examples in each batch.

Performance

Evaluation Protocol: Final evaluation on complete validation sets

Math: 10,000 examples (DeepMind Mathematics linear algebra 1D)
NLI: 9,815 examples (MultiNLI matched split)

Task	Accuracy	Baseline	Δ from Baseline
Mathematical Reasoning	12.0%	3.1%	+8.9pp
Natural Language Inference	86.2%	81.0%	+5.2pp

Key Findings from Our Study

Catastrophic Forgetting is Severe: Math-only training drops NLI accuracy from 81% to 16.5% (−64.5pp)
Mixed Training Eliminates Forgetting: Balanced 1:1 ratio maintains 86.2% NLI while achieving 12.0% math
No Performance Trade-off: Mixed training matches math-only performance (12.0% vs 12.0%)
Minimal Exposure Suffices: Even 6.25% NLI exposure (15:1 ratio) prevents catastrophic collapse

Usage

Basic Inference

from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("MarioBarbeque/flan-t5-base-mixed-1-1-catastrophic")
tokenizer = T5Tokenizer.from_pretrained("MarioBarbeque/flan-t5-base-mixed-1-1-catastrophic")

# Mathematical reasoning example
math_input = "Solve 24 = 1601*c - 1605*c for c."
inputs = tokenizer(math_input, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Expected: "-6"

# NLI example  
nli_input = "mnli premise: The cat sat on the mat. hypothesis: An animal was on the mat."
inputs = tokenizer(nli_input, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Expected: "yes" (entailment)

Batch Processing

import torch

# Batch of linear algebra problems
math_problems = [
    "Solve 24 = 1601*c - 1605*c for c.",
    "Solve 657 = -220*t + 1086*t + 22307 for t.",
    "Solve -11*y - 263*y + 3162 = -88*y for y."
]

inputs = tokenizer(math_problems, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_new_tokens=8)
results = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(results)

Training Code

The complete training code, evaluation scripts, and experiment configurations are available in our GitHub repository.

Related Models

Larger Scale: CyberSolve-LinAlg-1.2 - Flan-T5-Large (780M) achieving 90.8% on math (8× improvement over this 250M model)
Other Experiments: See all checkpoints from this study at MarioBarbeque's models

Citation

If you use this model in your research, please cite:

@article{reynolds2024catastrophic,
  title={Mitigating Catastrophic Forgetting in Mathematical Reasoning Finetuning through Mixed Training},
  author={Reynolds, John Graham},
  journal={arXiv preprint},
  year={2024},
  url={https://github.com/johngrahamreynolds/mathematical_catastrophe_mitigation}
}

For the CyberSolve-LinAlg model (Flan-T5-Large baseline):

@misc{cybersolve2024,
  author={Reynolds, John Graham},
  title={CyberSolve-LinAlg: Flan-T5-Large Finetuned for Linear Algebra Problem Solving},
  year={2024},
  howpublished={\url{https://huggingface.co/MarioBarbeque/CyberSolve-LinAlg-1.2}}
}

License

This model is released under the Apache 2.0 license, following the base model (google/flan-t5-base).

Model Card Authors

John Graham Reynolds (@MarioBarbeque)

Contact

Email: [email protected]
GitHub: @johngrahamreynolds

Acknowledgments

This research would not have been possible without the wonderful instruction of Greg Durrett. The author would also like to thank John Jumper for motivating this research during his visit to Vanderbilt University.

Downloads last month: 22

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MarioBarbeque/flan-t5-base-mixed-1-1-catastrophic

Base model

google/flan-t5-base

Finetuned

(868)

this model

Datasets used to train MarioBarbeque/flan-t5-base-mixed-1-1-catastrophic

Collection including MarioBarbeque/flan-t5-base-mixed-1-1-catastrophic

Catastrophic Forgetting in Mathematical Reasoning

Collection

7 items • Updated 1 day ago • 1