Qwen3-1.7B Magistral Math (GGUF)

License: Apache-2.0 Model: Qwen3-1.7B Format: GGUF Domain: Math Reasoning Quantizations: F16, Q8_0, Q4_K_M


TL;DR

This is a math-focused fine-tune of unsloth/Qwen3-1.7B-Base, exported to GGUF (F16 / Q8_0 / Q4_K_M) with Unsloth.

  • Goal: small 1.7B model specialized for grade-school & early high-school math reasoning.

  • Data: HAD653/GSM8K-OpenMath-MathReason-13k – 13.9k math word problems with structured chain-of-thought.

  • Format: answers always follow the same pattern:

    Problem:
    ...
    
    Reasoning:
    ...
    
    Answer:
    <final numeric answer>
    
  • Best use: GSM8K-style problems, OpenMath-style word problems, step-by-step reasoning with a single numeric final answer.

Model Description

  • Base model: unsloth/Qwen3-1.7B-Base (Apache-2.0)
  • Architecture: Qwen3 dense causal LM, ~1.7B params, 28 layers, GQA attention, 32k context.
  • Type: decoder-only LLM, text generation.
  • This repo: inference-only GGUF weights for llama.cpp / LM Studio / Ollama / text-generation-webui.

Available files

From the Files tab:

  • Qwen3-1.7B-Magistral-Math-F16.gguf – highest quality, requires the most VRAM.
  • Qwen3-1.7B-Magistral-Math-Q8_0.gguf – 8-bit quantization.
  • Qwen3-1.7B-Magistral-Math-Q4_K_M.gguf – 4-bit K-quant, best for smaller GPUs.

These files contain fine-tuned math weights, exported via model.save_pretrained_gguf after full BF16 training.


Training Data

This model is fine-tuned on:

  • Dataset: HAD653/GSM8K-OpenMath-MathReason-13k

  • Size: 13,857 examples.

  • Fields:

    • question: natural language math word problem.

    • cot: structured solution with three blocks:

      • Problem:
      • Reasoning:
      • Answer:
    • final_answer: canonical numeric answer (string).

The dataset focuses on easy–medium difficulty: basic arithmetic, fractions, percentages, rate problems, simple algebra, and simple combinatorics – the kind of tasks a 1–3B model can genuinely master.


Training Setup (Summary)

Fine-tuning was done with Unsloth + TRL on a single RTX 4090, using full BF16 fine-tuning (no LoRA).

Main hyperparameters:

  • Base: unsloth/Qwen3-1.7B-Base

  • Sequence length: 2048

  • Batching: per_device_train_batch_size = 2, gradient_accumulation_steps = 8

  • Effective batch size: ≈ 16 sequences

  • Epochs: 2

  • Optimizer / schedule:

    • learning_rate = 7e-5
    • linear scheduler, warmup_ratio = 0.05
    • weight_decay = 0.01
  • Precision & memory:

    • dtype = bfloat16
    • gradient_checkpointing = True

Supervision format

The training text for each sample is:

### Instruction:
{question}

### Response:
{cot}</s>

where </s> is the tokenizer EOS token. Adding eos_token at the end of each sample teaches the model when to stop, which greatly reduces “Answer: 36 / Answer: 36 / …” loops during inference.


Prompting & Templates

Recommended system prompt (optional but useful)

You are a math reasoning assistant.

For every question, answer in exactly this format:

Problem:
<restate the problem in your own words>

Reasoning:
<step-by-step reasoning showing all intermediate steps>

Answer:
<final numeric answer only, on its own line>

Do not add any extra commentary before or after the answer.
Do not repeat the answer multiple times.
Stop after writing the final answer.

Inference template (matches training)

Single-turn format:

### Instruction:
{question}

### Response:

The model will then generate:

Problem:
...

Reasoning:
...

Answer:
<number>

Stop strings

On top of the EOS token, you can add stop strings in your UI:

  • ### Instruction:
  • ### Response:

Many frontends (LM Studio, text-generation-webui, KoboldCpp, etc.) let you configure these so the model stops cleanly when it tries to start the next turn.


Quantization & Hardware Tips

The three variants in this repo roughly behave as follows (ballpark):

  • Q4_K_M (~1.1 GB) – best for:

    • 4–6 GB GPUs or pure CPU inference.
    • Fast experimentation / local tools / “math assistant on a laptop”.
  • Q8_0 (~1.8 GB) – good compromise:

    • 8–12 GB GPUs.
    • Often slightly more stable than Q4 on harder problems.
  • F16 (~3.5 GB) – highest fidelity:

    • 12+ GB GPUs (4090, 4080, 4070 12GB, A4000 etc.).
    • Recommended if VRAM allows and you care about maximum accuracy.

As a rule of thumb, choose a file that is 1–2 GB smaller than your available VRAM.


Usage Examples

llama.cpp

Once you have built llama.cpp, you can run the model like this (replace with your path):

./llama-cli \
  -m Qwen3-1.7B-Magistral-Math-Q4_K_M.gguf \
  -p "### Instruction:
Albert buys 2 large pizzas and 2 small pizzas. A large pizza has 16 slices and a small pizza has 8 slices. If he eats it all, how many pieces does he eat that day?

### Response:
" \
  -n 256 \
  --temp 0.1 \
  --top-p 0.9 \
  --repeat-penalty 1.05

Suggested decoding for math:

  • temperature: 0.0–0.2
  • top_p: 0.9
  • repeat_penalty: 1.05–1.1
  • top_k: 20–40 (optional tweak)

LM Studio / other UIs

Set the prompt template to:

### Instruction:
{{prompt}}

### Response:

Add stop strings:

  • ### Instruction:
  • ### Response:

and keep temperature low for math benchmarks.


Intended Uses & Limitations

Intended uses

  • Solving GSM8K-style and OpenMath-style word problems.
  • Training / evaluating small-scale math reasoning pipelines.
  • Serving as a local math tutor for grade-school / early high-school algebra & arithmetic.

Limitations

  • Not a general chat/instruction model; it is biased toward math.
  • CoT is learned from synthetic teacher traces, not human-written solutions.
  • Not suitable for high-stakes educational or decision-making without human oversight.
  • Performance on very hard competition math (Olympiad-level, deep proofs) will be limited – the training data explicitly focuses on easy–medium difficulty.

Users are responsible for ensuring there is no data leakage if they evaluate on GSM8K/OpenMath-derived benchmarks.


Acknowledgements


Citation

If you use this model in your work, please cite:

@misc{had653_qwen3_magistral_math_gguf_2025,
  author       = {HAD653},
  title        = {Qwen3-1.7B Magistral Math (GGUF): A 1.7B Math Reasoning Model with Magistral Chain-of-Thought},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/HAD653/qwen3-1.7b-magistral-math-gguf}},
  note         = {Fine-tuned on GSM8K + OpenMath MathReason 13k, exported to GGUF (F16 / Q8\_0 / Q4\_K\_M).}
}
Downloads last month
97
GGUF
Model size
2B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HAD653/qwen3-1.7b-magistral-math-gguf

Quantized
(3)
this model

Dataset used to train HAD653/qwen3-1.7b-magistral-math-gguf