neuralmagic/calibration
Viewer • Updated • 20k • 872 • 7
This model is fully compatible with vLLM and optimized to run on a single GPU with 16GB VRAM, achieving ~1.5× faster performance compared to the INT4 version, with potentially better accuracy.
DATASET_ID = "neuralmagic/calibration"
NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 8192
VLLM_SLEEP_WHEN_IDLE=1 vllm serve lhoang8500/Qwen3-VL-8B-Instruct-NVFP4 --max-model-len 32768 -tp 1 --limit_mm_per_prompt '{"image":1, "video":0}' --kv-cache-dtype fp8 --gpu-memory-utilization 0.9 --max-num-seqs 64 --tool-call-parser hermes --enable-auto-tool-choice
greedy='false'
seed=3407
top_p=0.8
top_k=20
temperature=0.7
repetition_penalty=1.0
presence_penalty=1.5
out_seq_length=32768
see file quantize.py, credit to llmcompressor
Base model
Qwen/Qwen3-VL-8B-Instruct