2507 Thinking model release

#4
by anjeysapkovski - opened

Dear team!

Current autoround q2ks version of Qwen3-30B-A3B-Instruct-2507 is amazingly fast and stable on consumer's GPU like RTX 3060 12 GB and RTX 5060 16 GB. With 110t/s output and reasonable results in multilanguage tasks.

It fits 12-16 GB VRAM cards ideally, leaving space for reasonable context.

I was not able to find the reasoning version of Qwen3-30B-Instruct-2507 with the same q2ks autoround quantization. Tried quantization on local PC, but 128 GB RAM was not enough even for Qwen3 4B. Could you release the thinking model as q2ks gguf? Currently GPT-OSS 20B MXFP4 is leader for GPU VRAM <= 16 GB.

Intel org

Qwen3-30B-Instruct-2507 is not a MoE model, so a larger loss with 2-bit precision is expected. We'll look into resolving the RAM issue once we're back in the office. Thanks!

Thanks. I'm asking you to generate Qwen3-30B-A3B-Thinking-2507-gguf-q2ks-mixed-AutoRound
By the way, why Instruct model is not MoE? According to official documentation, both Instruct and Thinking models are 3B active parameters MoE (128 experts, 8 active).

Thank you so much!

Please do the same with the 80b version !!

Intel org

Please do the same with the 80b version !!

@groxaxo Do you mean Qwen/Qwen3-Next-80B-A3B-Instruct. Qwen3-Next series is not supported by llama-cpp. After support, we will try to provide a quantitative model.

Intel org

@anjeysapkovski "128 GB RAM was not enough even for Qwen3 4B". I haven't been able to reproduce this issue yet, could you provide more information? For example, operating environment, runing log and etc. We will try to reproduce and fix it.

@n1ck-guo , Windows 11, 128 GB RAM, 16 GB VRAM. I don't remember the exact config, something like that:

from auto_round import AutoRound

model_name_or_path = "./Qwen3-4B-Thinking-2507"

ar = AutoRound(
model=model_name_or_path,
scheme="W4A16",
iters=0,

)

output_dir = "./tmp_autoround"
ar.quantize_and_save(output_dir, format="gguf:q4_k_s")

Task manager showed memory consumption up to 100+GB with OOM crash. I tried low memory flags with no success. I would be nice to estimate and output the required number of memory or to allow algorithm to work in constrained resource environment.

Intel org

This issue is difficult to reproduce. However, we have refined the logic, and in the main branch, quantizing an 8B model to GGUF with iters=0 typically requires 12–16 GB of RAM and 8 GB of VRAM.
We have also added a memory monitor in the main branch. You can try again, and feel free to open an issue in the AutoRound repository if you still encounter any problems.

Sign up or log in to comment