Qwen3-Next-80B-A3B-Instruct-Ultra-Hybrid

TevunahAi Professional Grade 2048 Sample Calibration Ultra Hybrid vLLM Required

Overview

Enterprise-quality Ultra Hybrid quantization of Qwen3-Next-80B-A3B-Instruct MoE, featuring mixed-precision INT4/INT8/FP8 with 2048-sample calibration (8x industry standard).

Property Value
Original Size ~160GB (FP16)
Quantized Size ~43GB
Compression ~73% reduction
Quality Retention 98-99%
Active Parameters ~3B per token
Inference Runtime vLLM required

⚠️ Note: This model uses compressed-tensors format and requires vLLM for inference. It is not compatible with standard Transformers inference.

Model Architecture

Qwen3-Next-80B-A3B is a state-of-the-art Mixture-of-Experts model featuring:

  • 80B total parameters, ~3B active per token
  • 48 layers with hybrid attention (Gated DeltaNet + Gated Attention)
  • 512 experts, 10 active + 1 shared per token
  • Native 262K context, expandable to 1M+

Quantization Strategy

Component Precision Rationale
Embeddings FP16 Vocabulary precision preserved
First 2 Attention Layers FP8 Foundation layers need precision
Middle Attention (Layers 2-45) W8A8 Balanced performance/size
Last 2 Attention Layers FP8 Output precision critical
All MoE Experts (512) W4A16 Massive compression, minimal loss
MoE Gate/Router FP16 Critical for routing accuracy
Shared Expert W4A16 Consistent with regular experts
LM Head FP16 Output quality preserved
LayerNorms FP16 Normalization preserved

Why Ultra Hybrid Works for MoE

  1. Expert layers dominate β€” 512 experts Γ— 48 layers = majority of parameters
  2. INT4 experts β€” Massive compression with minimal quality impact
  3. FP16 routing β€” Gate/router preserved ensures correct expert selection
  4. Precision where it matters β€” FP8/INT8 on attention maintains reasoning

Quantization Details

Property Value
Base Model Qwen/Qwen3-Next-80B-A3B-Instruct
Architecture Qwen3NextForCausalLM (MoE)
Method Ultra Hybrid (W4A16 + W8A8 + FP8)
Total Layers 48
Total Experts 512
Active Experts 10 + 1 shared
Calibration Samples 2,048 (Professional Grade)
Calibration Seq Length 2,048 tokens
Quantization Time 402.4 minutes (~6.7 hours)

Calibration Datasets

Dataset Samples Purpose
CodeAlpaca-20k 512 Code instruction following
OpenHermes-2.5 512 Diverse general instructions
Open-Platypus 512 STEM and reasoning
UltraChat-200k 512 Multi-turn conversation

Hardware Used

Component Specification
CPU Dual Intel Xeon Max 9480 (224 threads)
Memory 128GB HBM2e + 256GB DDR5
GPU NVIDIA RTX 5000 Ada (32GB)
Optimizations Intel AMX, TF32

Performance Comparison

Version Size Quality Speed
FP16 ~160GB 100% Baseline
FP8 ~80GB 98-99% ~1.2x
INT8 ~80GB 97-98% ~1.3x
INT4 ~40GB 95-97% ~1.5x
Ultra Hybrid ~43GB 98-99% ~1.4x

Usage (vLLM Required)

This model requires vLLM for inference. Install with:

pip install vllm

Basic Generation

from vllm import LLM, SamplingParams

llm = LLM(
    model="TevunahAi/Qwen3-Next-80B-A3B-Instruct-Ultra-Hybrid",
    trust_remote_code=True,
    tensor_parallel_size=1,  # Increase for multi-GPU
)

sampling_params = SamplingParams(
    temperature=0.7, 
    max_tokens=256,
    top_p=0.9,
)

outputs = llm.generate(
    ["Explain quantum computing in simple terms:"], 
    sampling_params
)
print(outputs[0].outputs[0].text)

Chat Interface

from vllm import LLM, SamplingParams

llm = LLM(
    model="TevunahAi/Qwen3-Next-80B-A3B-Instruct-Ultra-Hybrid",
    trust_remote_code=True,
)

sampling_params = SamplingParams(
    temperature=0.7, 
    max_tokens=256,
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms:"}
]

outputs = llm.chat(messages, sampling_params)
print(outputs[0].outputs[0].text)

Multi-GPU Inference

from vllm import LLM, SamplingParams

# Use tensor parallelism for multi-GPU
llm = LLM(
    model="TevunahAi/Qwen3-Next-80B-A3B-Instruct-Ultra-Hybrid",
    trust_remote_code=True,
    tensor_parallel_size=2,  # 2 GPUs
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Your prompt here"], sampling_params)
print(outputs[0].outputs[0].text)

OpenAI-Compatible Server

# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
    --model TevunahAi/Qwen3-Next-80B-A3B-Instruct-Ultra-Hybrid \
    --trust-remote-code \
    --port 8000

# Query with OpenAI client
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "TevunahAi/Qwen3-Next-80B-A3B-Instruct-Ultra-Hybrid",
        "prompt": "Explain quantum computing:",
        "max_tokens": 256
    }'

Hardware Requirements

Setup VRAM Notes
Minimum 48GB Single A6000/L40S
Recommended 80GB Single A100/H100
Multi-GPU 2x 24GB RTX 3090/4090 with tensor_parallel_size=2
Optimal 2x 48GB+ Full speed, tensor parallel

TevunahAi Professional Standard

What sets TevunahAi quantizations apart:
Aspect Industry Standard TevunahAi Professional
Calibration Samples 256 2,048 (8x more)
Dataset Diversity Single dataset 4 diverse datasets
Hardware Consumer GPU Enterprise dual-socket + HBM
Quality Focus Speed Production-grade accuracy

Why 2048 Samples Matter

  • Broader activation coverage β€” Captures ranges across code, chat, reasoning, STEM
  • Reduced outlier sensitivity β€” More samples = more stable quantization ranges
  • Better edge-case handling β€” Rare tokens/patterns properly calibrated
  • Production reliability β€” Suitable for enterprise deployment

License

This model inherits the Apache 2.0 License from Qwen/Alibaba.

Citation

@misc{tevunahai2025qwen3next80b,
  title={Qwen3-Next-80B-A3B-Instruct-Ultra-Hybrid: Professional-Grade MoE Quantization},
  author={TevunahAi},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/TevunahAi/Qwen3-Next-80B-A3B-Instruct-Ultra-Hybrid}
}

Quantized with care by TevunahAi using enterprise-grade hardware and calibration standards

View all TevunahAi models
Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for TevunahAi/Qwen3-Next-80B-A3B-Instruct-Ultra-Hybrid

Quantized
(65)
this model

Collection including TevunahAi/Qwen3-Next-80B-A3B-Instruct-Ultra-Hybrid