DeepSeek V4 Flash dsv4_int INT4/INT8

This checkpoint is experimental and under active development. It is intended for the AppMana Ampere vLLM fork and is not a general-purpose Hugging Face Transformers checkpoint.

Generated on 2026-05-11 from clean source checkpoint:

deepseek-ai/DeepSeek-V4-Flash@fd53f944496234770ba80e15004f9b6d269a71f5

Conversion command:

CUDA_VISIBLE_DEVICES=1 python tools/ampere/dsv4_requant_checkpoint.py \
  --src /home/administrator/inference/.cache/huggingface/models--deepseek-ai--DeepSeek-V4-Flash/snapshots/fd53f944496234770ba80e15004f9b6d269a71f5 \
  --dst /home/administrator/inference/deepseek-v4-flash-dsv4-int-channel-vllm \
  --device cuda:0 \
  --dense-int8-strategy channel \
  --overwrite

Quantization format:

Routed MoE experts: MXFP4 source weights converted to packed symmetric INT4 W4A16, group size 32, for Marlin MoE.
Dense FP8 linears: channelwise biased UINT8 W8A16 format for the Ampere AllSpark path where supported.
Preserved precision: embeddings, norms, gates, attention sinks, HC tensors, and other tensors marked BF16/F32 in the source.

Structural audit:

Safetensor shards: 46
Size: about 157 GiB
Expert INT4 tensors: 33,792
Dense INT8 tensors: 375
Preserved tensors: 853
Missing expert scales: 0

Known status:

The 2-layer version of this conversion path loads and generates locally under vLLM with compile and CUDA graphs enabled.
The full 43-layer checkpoint has been converted and structurally audited, but must still pass a full distributed vLLM load/generation test before treating it as usable.
Quality/perplexity is not yet validated. Do not assume this matches the original FP4/FP8 DeepSeek checkpoint until evaluation says so.

Downloads last month: 34

Safetensors

Model size

158B params

Tensor type

BF16

F32

I64

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for appmana/deepseek-v4-int4-int8

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(43)

this model