GLM-4.5-Air (GPTQModel W8A16 Quantization)
Quantization Details & Hardware Requirements
This is a W8A16 (8-bit weights, 16-bit activations) quantized version of the GLM-4.5-Air model.
Methodology
The quantization was performed using **GPTQModel**with an experimental modification that feeds the whole dataset to each expert to achieve improved quality.
Calibration Dataset: The dataset used during quantization consists of 2320 samples: c4/en (1536), arc (300), gsm8k (300), humaneval (164), alpaca (20)
Hardware & Performance: This model is verified to run with Tensor Parallel (TP) on 8x NVIDIA RTX 3090 GPUs with a context window of 131,072 tokens.
How to Run (vLLM)
You can serve this model using vLLM. Below is a sample command optimized for an 8x3090 setup:
export VLLM_ATTENTION_BACKEND="FLASHINFER"
export TORCH_CUDA_ARCH_LIST="8.6"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export VLLM_MARLIN_USE_ATOMIC_ADD=1
export SAFETENSORS_FAST_GPU=1
vllm serve avtc/GLM-4.5-Air-GPTQMODEL-W8A16 \
-tp 8 \
--port 8000 \
--host 0.0.0.0 \
--uvicorn-log-level info \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--max-num-seqs 1 \
--trust-remote-code \
--dtype=float16 \
--seed 1234 \
--max-model-len 131072 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--enable-sleep-mode \
--compilation-config '{"level": 3, "cudagraph_capture_sizes": [1]}'
Recommended Sampling Parameters:
{
"top_p": 0.95,
"temperature": 0.8,
"repetition_penalty": 1.05,
"top_k": 40,
"min_p": 0.05
}
Example Output
Prompt:
Make an html animation of fishes in an aquarium. The aquarium is pretty, the fishes vary in colors and sizes and swim realistically. You can left click to place a piece of fish food in aquarium. Each fish chases a food piece closest to it, trying to eat it. Once there are no more food pieces, fishes resume swimming as usual.
Result: The model generated a working artifacts
Acknowledgements
Special thanks to the GPTQModel team for their tools and support in enabling this quantization.
Original Model Introduction
👋 Join the Discord community.
📖 Check out the GLM-4.5 technical blog, technical report, and Zhipu AI technical documentation.
The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.
Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses.
We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for secondary development.
As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional performance with a score of 63.2, in the 3rd place among all the proprietary and open-source models. Notably, GLM-4.5-Air delivers competitive results at 59.8 while maintaining superior efficiency.
For more eval results, show cases, and technical details, please visit our technical blog or technical report.
The model code, tool parser and reasoning parser can be found in the implementation of transformers, vLLM and SGLang.
Quick Start
Please refer to the GLM-4.5 github page for more details on the original architecture and usage.
- Downloads last month
- 28
Model tree for avtc/GLM-4.5-Air-GPTQMODEL-W8A16
Base model
zai-org/GLM-4.5-Air