2507 Thinking model release
Dear team!
Current autoround q2ks version of Qwen3-30B-A3B-Instruct-2507 is amazingly fast and stable on consumer's GPU like RTX 3060 12 GB and RTX 5060 16 GB. With 110t/s output and reasonable results in multilanguage tasks.
It fits 12-16 GB VRAM cards ideally, leaving space for reasonable context.
I was not able to find the reasoning version of Qwen3-30B-Instruct-2507 with the same q2ks autoround quantization. Tried quantization on local PC, but 128 GB RAM was not enough even for Qwen3 4B. Could you release the thinking model as q2ks gguf? Currently GPT-OSS 20B MXFP4 is leader for GPU VRAM <= 16 GB.
Qwen3-30B-Instruct-2507 is not a MoE model, so a larger loss with 2-bit precision is expected. We'll look into resolving the RAM issue once we're back in the office. Thanks!
Thanks. I'm asking you to generate Qwen3-30B-A3B-Thinking-2507-gguf-q2ks-mixed-AutoRound
By the way, why Instruct model is not MoE? According to official documentation, both Instruct and Thinking models are 3B active parameters MoE (128 experts, 8 active).
Thank you so much!
Please do the same with the 80b version !!
@anjeysapkovski "128 GB RAM was not enough even for Qwen3 4B". I haven't been able to reproduce this issue yet, could you provide more information? For example, operating environment, runing log and etc. We will try to reproduce and fix it.
@n1ck-guo , Windows 11, 128 GB RAM, 16 GB VRAM. I don't remember the exact config, something like that:
from auto_round import AutoRound
model_name_or_path = "./Qwen3-4B-Thinking-2507"
ar = AutoRound(
model=model_name_or_path,
scheme="W4A16",
iters=0,
)
output_dir = "./tmp_autoround"
ar.quantize_and_save(output_dir, format="gguf:q4_k_s")
Task manager showed memory consumption up to 100+GB with OOM crash. I tried low memory flags with no success. I would be nice to estimate and output the required number of memory or to allow algorithm to work in constrained resource environment.
This issue is difficult to reproduce. However, we have refined the logic, and in the main branch, quantizing an 8B model to GGUF with iters=0 typically requires 12–16 GB of RAM and 8 GB of VRAM.
We have also added a memory monitor in the main branch. You can try again, and feel free to open an issue in the AutoRound repository if you still encounter any problems.