2507 Thinking model release

by anjeysapkovski - opened Oct 7

Oct 7

Dear team!

Current autoround q2ks version of Qwen3-30B-A3B-Instruct-2507 is amazingly fast and stable on consumer's GPU like RTX 3060 12 GB and RTX 5060 16 GB. With 110t/s output and reasonable results in multilanguage tasks.

It fits 12-16 GB VRAM cards ideally, leaving space for reasonable context.

I was not able to find the reasoning version of Qwen3-30B-Instruct-2507 with the same q2ks autoround quantization. Tried quantization on local PC, but 128 GB RAM was not enough even for Qwen3 4B. Could you release the thinking model as q2ks gguf? Currently GPT-OSS 20B MXFP4 is leader for GPU VRAM <= 16 GB.

wenhuach

Intel org Oct 8

Qwen3-30B-Instruct-2507 is not a MoE model, so a larger loss with 2-bit precision is expected. We'll look into resolving the RAM issue once we're back in the office. Thanks!

anjeysapkovski

Oct 8

Thanks. I'm asking you to generate Qwen3-30B-A3B-Thinking-2507-gguf-q2ks-mixed-AutoRound
By the way, why Instruct model is not MoE? According to official documentation, both Instruct and Thinking models are 3B active parameters MoE (128 experts, 8 active).

n1ck-guo

Intel org Oct 9

@anjeysapkovski Here is the model, https://huggingface.co/Intel/Qwen3-30B-A3B-Thinking-2507-gguf-q2ks-mixed-AutoRound

anjeysapkovski

Oct 9

Thank you so much!

groxaxo

Oct 10

Please do the same with the 80b version !!

n1ck-guo

Intel org Oct 13

Please do the same with the 80b version !!

@groxaxo Do you mean Qwen/Qwen3-Next-80B-A3B-Instruct. Qwen3-Next series is not supported by llama-cpp. After support, we will try to provide a quantitative model.

n1ck-guo

Intel org Oct 15

@anjeysapkovski "128 GB RAM was not enough even for Qwen3 4B". I haven't been able to reproduce this issue yet, could you provide more information? For example, operating environment, runing log and etc. We will try to reproduce and fix it.

anjeysapkovski

7 days ago

@n1ck-guo , Windows 11, 128 GB RAM, 16 GB VRAM. I don't remember the exact config, something like that:

from auto_round import AutoRound

model_name_or_path = "./Qwen3-4B-Thinking-2507"

ar = AutoRound(
model=model_name_or_path,
scheme="W4A16",
iters=0,

)

output_dir = "./tmp_autoround"
ar.quantize_and_save(output_dir, format="gguf:q4_k_s")

Task manager showed memory consumption up to 100+GB with OOM crash. I tried low memory flags with no success. I would be nice to estimate and output the required number of memory or to allow algorithm to work in constrained resource environment.

wenhuach

Intel org 7 days ago

This issue is difficult to reproduce. However, we have refined the logic, and in the main branch, quantizing an 8B model to GGUF with iters=0 typically requires 12–16 GB of RAM and 8 GB of VRAM.
We have also added a memory monitor in the main branch. You can try again, and feel free to open an issue in the AutoRound repository if you still encounter any problems.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment