DeepSeek V4 Flash dsv4_int INT4/INT8

This checkpoint is experimental and under active development. It is intended for the AppMana Ampere vLLM fork and is not a general-purpose Hugging Face Transformers checkpoint.

Generated on 2026-05-11 from clean source checkpoint:

deepseek-ai/DeepSeek-V4-Flash@fd53f944496234770ba80e15004f9b6d269a71f5

Conversion command:

CUDA_VISIBLE_DEVICES=1 python tools/ampere/dsv4_requant_checkpoint.py \
  --src /home/administrator/inference/.cache/huggingface/models--deepseek-ai--DeepSeek-V4-Flash/snapshots/fd53f944496234770ba80e15004f9b6d269a71f5 \
  --dst /home/administrator/inference/deepseek-v4-flash-dsv4-int-channel-vllm \
  --device cuda:0 \
  --dense-int8-strategy channel \
  --overwrite

Quantization format:

  • Routed MoE experts: MXFP4 source weights converted to packed symmetric INT4 W4A16, group size 32, for Marlin MoE.
  • Dense FP8 linears: channelwise biased UINT8 W8A16 format for the Ampere AllSpark path where supported.
  • Preserved precision: embeddings, norms, gates, attention sinks, HC tensors, and other tensors marked BF16/F32 in the source.

Structural audit:

  • Safetensor shards: 46
  • Size: about 157 GiB
  • Expert INT4 tensors: 33,792
  • Dense INT8 tensors: 375
  • Preserved tensors: 853
  • Missing expert scales: 0

Known status:

  • The 2-layer version of this conversion path loads and generates locally under vLLM with compile and CUDA graphs enabled.
  • The full 43-layer checkpoint has been converted and structurally audited, but must still pass a full distributed vLLM load/generation test before treating it as usable.
  • Quality/perplexity is not yet validated. Do not assume this matches the original FP4/FP8 DeepSeek checkpoint until evaluation says so.
Downloads last month
34
Safetensors
Model size
158B params
Tensor type
BF16
·
F32
·
I64
·
I8
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for appmana/deepseek-v4-int4-int8

Quantized
(43)
this model