Qwen3-1.7B Magistral Math (GGUF)

TL;DR

This is a math-focused fine-tune of unsloth/Qwen3-1.7B-Base, exported to GGUF (F16 / Q8_0 / Q4_K_M) with Unsloth.

Goal: small 1.7B model specialized for grade-school & early high-school math reasoning.
Data: HAD653/GSM8K-OpenMath-MathReason-13k – 13.9k math word problems with structured chain-of-thought.

Format: answers always follow the same pattern:

Problem:
...

Reasoning:
...

Answer:
<final numeric answer>

Best use: GSM8K-style problems, OpenMath-style word problems, step-by-step reasoning with a single numeric final answer.

Model Description

Base model: unsloth/Qwen3-1.7B-Base (Apache-2.0)
Architecture: Qwen3 dense causal LM, ~1.7B params, 28 layers, GQA attention, 32k context.
Type: decoder-only LLM, text generation.
This repo: inference-only GGUF weights for llama.cpp / LM Studio / Ollama / text-generation-webui.

Available files

From the Files tab:

Qwen3-1.7B-Magistral-Math-F16.gguf – highest quality, requires the most VRAM.
Qwen3-1.7B-Magistral-Math-Q8_0.gguf – 8-bit quantization.
Qwen3-1.7B-Magistral-Math-Q4_K_M.gguf – 4-bit K-quant, best for smaller GPUs.

These files contain fine-tuned math weights, exported via model.save_pretrained_gguf after full BF16 training.

Training Data

This model is fine-tuned on:

Dataset: HAD653/GSM8K-OpenMath-MathReason-13k
Size: 13,857 examples.
Fields:
- question: natural language math word problem.
- cot: structured solution with three blocks:
  - Problem:
  - Reasoning:
  - Answer:
- final_answer: canonical numeric answer (string).

The dataset focuses on easy–medium difficulty: basic arithmetic, fractions, percentages, rate problems, simple algebra, and simple combinatorics – the kind of tasks a 1–3B model can genuinely master.

Training Setup (Summary)

Fine-tuning was done with Unsloth + TRL on a single RTX 4090, using full BF16 fine-tuning (no LoRA).

Main hyperparameters:

Base: unsloth/Qwen3-1.7B-Base
Sequence length: 2048
Batching: per_device_train_batch_size = 2, gradient_accumulation_steps = 8
Effective batch size: ≈ 16 sequences
Epochs: 2
Optimizer / schedule:
- learning_rate = 7e-5
- linear scheduler, warmup_ratio = 0.05
- weight_decay = 0.01
Precision & memory:
- dtype = bfloat16
- gradient_checkpointing = True

Supervision format

The training text for each sample is:

### Instruction:
{question}

### Response:
{cot}</s>

where </s> is the tokenizer EOS token. Adding eos_token at the end of each sample teaches the model when to stop, which greatly reduces “Answer: 36 / Answer: 36 / …” loops during inference.

Prompting & Templates

Recommended system prompt (optional but useful)

You are a math reasoning assistant.

For every question, answer in exactly this format:

Problem:
<restate the problem in your own words>

Reasoning:
<step-by-step reasoning showing all intermediate steps>

Answer:
<final numeric answer only, on its own line>

Do not add any extra commentary before or after the answer.
Do not repeat the answer multiple times.
Stop after writing the final answer.

Inference template (matches training)

Single-turn format:

### Instruction:
{question}

### Response:

The model will then generate:

Problem:
...

Reasoning:
...

Answer:
<number>

Stop strings

On top of the EOS token, you can add stop strings in your UI:

### Instruction:
### Response:

Many frontends (LM Studio, text-generation-webui, KoboldCpp, etc.) let you configure these so the model stops cleanly when it tries to start the next turn.

Quantization & Hardware Tips

The three variants in this repo roughly behave as follows (ballpark):

Q4_K_M (~1.1 GB) – best for:
- 4–6 GB GPUs or pure CPU inference.
- Fast experimentation / local tools / “math assistant on a laptop”.
Q8_0 (~1.8 GB) – good compromise:
- 8–12 GB GPUs.
- Often slightly more stable than Q4 on harder problems.
F16 (~3.5 GB) – highest fidelity:
- 12+ GB GPUs (4090, 4080, 4070 12GB, A4000 etc.).
- Recommended if VRAM allows and you care about maximum accuracy.

As a rule of thumb, choose a file that is 1–2 GB smaller than your available VRAM.

Usage Examples

llama.cpp

Once you have built llama.cpp, you can run the model like this (replace with your path):

./llama-cli \
  -m Qwen3-1.7B-Magistral-Math-Q4_K_M.gguf \
  -p "### Instruction:
Albert buys 2 large pizzas and 2 small pizzas. A large pizza has 16 slices and a small pizza has 8 slices. If he eats it all, how many pieces does he eat that day?

### Response:
" \
  -n 256 \
  --temp 0.1 \
  --top-p 0.9 \
  --repeat-penalty 1.05

Suggested decoding for math:

temperature: 0.0–0.2
top_p: 0.9
repeat_penalty: 1.05–1.1
top_k: 20–40 (optional tweak)

LM Studio / other UIs

Set the prompt template to:

### Instruction:
{{prompt}}

### Response:

Add stop strings:

### Instruction:
### Response:

and keep temperature low for math benchmarks.

Intended Uses & Limitations

Intended uses

Solving GSM8K-style and OpenMath-style word problems.
Training / evaluating small-scale math reasoning pipelines.
Serving as a local math tutor for grade-school / early high-school algebra & arithmetic.

Limitations

Not a general chat/instruction model; it is biased toward math.
CoT is learned from synthetic teacher traces, not human-written solutions.
Not suitable for high-stakes educational or decision-making without human oversight.
Performance on very hard competition math (Olympiad-level, deep proofs) will be limited – the training data explicitly focuses on easy–medium difficulty.

Users are responsible for ensuring there is no data leakage if they evaluate on GSM8K/OpenMath-derived benchmarks.

Acknowledgements

Base model: Qwen/Qwen3-1.7B-Base and the Qwen / Unsloth teams.
Unsloth for fast fine-tuning and GGUF export.
Training data: HAD653/GSM8K-OpenMath-MathReason-13k.

Citation

If you use this model in your work, please cite:

@misc{had653_qwen3_magistral_math_gguf_2025,
  author       = {HAD653},
  title        = {Qwen3-1.7B Magistral Math (GGUF): A 1.7B Math Reasoning Model with Magistral Chain-of-Thought},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/HAD653/qwen3-1.7b-magistral-math-gguf}},
  note         = {Fine-tuned on GSM8K + OpenMath MathReason 13k, exported to GGUF (F16 / Q8\_0 / Q4\_K\_M).}
}