Ultra Quantization Hybrid Model.
Collection
These models are quantized in mixed precision that allows them to have a smaller footprint than fp8, but still high quality.
β’
3 items
β’
Updated
Enterprise-quality Ultra Hybrid quantization of Qwen3-Next-80B-A3B-Instruct MoE, featuring mixed-precision INT4/INT8/FP8 with 2048-sample calibration (8x industry standard).
| Property | Value |
|---|---|
| Original Size | ~160GB (FP16) |
| Quantized Size | ~43GB |
| Compression | ~73% reduction |
| Quality Retention | 98-99% |
| Active Parameters | ~3B per token |
| Inference Runtime | vLLM required |
β οΈ Note: This model uses compressed-tensors format and requires vLLM for inference. It is not compatible with standard Transformers inference.
Qwen3-Next-80B-A3B is a state-of-the-art Mixture-of-Experts model featuring:
| Component | Precision | Rationale |
|---|---|---|
| Embeddings | FP16 | Vocabulary precision preserved |
| First 2 Attention Layers | FP8 | Foundation layers need precision |
| Middle Attention (Layers 2-45) | W8A8 | Balanced performance/size |
| Last 2 Attention Layers | FP8 | Output precision critical |
| All MoE Experts (512) | W4A16 | Massive compression, minimal loss |
| MoE Gate/Router | FP16 | Critical for routing accuracy |
| Shared Expert | W4A16 | Consistent with regular experts |
| LM Head | FP16 | Output quality preserved |
| LayerNorms | FP16 | Normalization preserved |
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3-Next-80B-A3B-Instruct |
| Architecture | Qwen3NextForCausalLM (MoE) |
| Method | Ultra Hybrid (W4A16 + W8A8 + FP8) |
| Total Layers | 48 |
| Total Experts | 512 |
| Active Experts | 10 + 1 shared |
| Calibration Samples | 2,048 (Professional Grade) |
| Calibration Seq Length | 2,048 tokens |
| Quantization Time | 402.4 minutes (~6.7 hours) |
| Dataset | Samples | Purpose |
|---|---|---|
| CodeAlpaca-20k | 512 | Code instruction following |
| OpenHermes-2.5 | 512 | Diverse general instructions |
| Open-Platypus | 512 | STEM and reasoning |
| UltraChat-200k | 512 | Multi-turn conversation |
| Component | Specification |
|---|---|
| CPU | Dual Intel Xeon Max 9480 (224 threads) |
| Memory | 128GB HBM2e + 256GB DDR5 |
| GPU | NVIDIA RTX 5000 Ada (32GB) |
| Optimizations | Intel AMX, TF32 |
| Version | Size | Quality | Speed |
|---|---|---|---|
| FP16 | ~160GB | 100% | Baseline |
| FP8 | ~80GB | 98-99% | ~1.2x |
| INT8 | ~80GB | 97-98% | ~1.3x |
| INT4 | ~40GB | 95-97% | ~1.5x |
| Ultra Hybrid | ~43GB | 98-99% | ~1.4x |
This model requires vLLM for inference. Install with:
pip install vllm
from vllm import LLM, SamplingParams
llm = LLM(
model="TevunahAi/Qwen3-Next-80B-A3B-Instruct-Ultra-Hybrid",
trust_remote_code=True,
tensor_parallel_size=1, # Increase for multi-GPU
)
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=256,
top_p=0.9,
)
outputs = llm.generate(
["Explain quantum computing in simple terms:"],
sampling_params
)
print(outputs[0].outputs[0].text)
from vllm import LLM, SamplingParams
llm = LLM(
model="TevunahAi/Qwen3-Next-80B-A3B-Instruct-Ultra-Hybrid",
trust_remote_code=True,
)
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=256,
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms:"}
]
outputs = llm.chat(messages, sampling_params)
print(outputs[0].outputs[0].text)
from vllm import LLM, SamplingParams
# Use tensor parallelism for multi-GPU
llm = LLM(
model="TevunahAi/Qwen3-Next-80B-A3B-Instruct-Ultra-Hybrid",
trust_remote_code=True,
tensor_parallel_size=2, # 2 GPUs
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Your prompt here"], sampling_params)
print(outputs[0].outputs[0].text)
# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
--model TevunahAi/Qwen3-Next-80B-A3B-Instruct-Ultra-Hybrid \
--trust-remote-code \
--port 8000
# Query with OpenAI client
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TevunahAi/Qwen3-Next-80B-A3B-Instruct-Ultra-Hybrid",
"prompt": "Explain quantum computing:",
"max_tokens": 256
}'
| Setup | VRAM | Notes |
|---|---|---|
| Minimum | 48GB | Single A6000/L40S |
| Recommended | 80GB | Single A100/H100 |
| Multi-GPU | 2x 24GB | RTX 3090/4090 with tensor_parallel_size=2 |
| Optimal | 2x 48GB+ | Full speed, tensor parallel |
| Aspect | Industry Standard | TevunahAi Professional |
|---|---|---|
| Calibration Samples | 256 | 2,048 (8x more) |
| Dataset Diversity | Single dataset | 4 diverse datasets |
| Hardware | Consumer GPU | Enterprise dual-socket + HBM |
| Quality Focus | Speed | Production-grade accuracy |
This model inherits the Apache 2.0 License from Qwen/Alibaba.
@misc{tevunahai2025qwen3next80b,
title={Qwen3-Next-80B-A3B-Instruct-Ultra-Hybrid: Professional-Grade MoE Quantization},
author={TevunahAi},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/TevunahAi/Qwen3-Next-80B-A3B-Instruct-Ultra-Hybrid}
}
Base model
Qwen/Qwen3-Next-80B-A3B-Instruct