Qwen3.6-27B DFlash Draft — GGUF

GGUF quantizations of the z-lab/Qwen3.6-27B-DFlash draft model, produced for the Lucebox dflash engine (speculative decoding for Qwen3.6-27B-Q4_K_M).

Source: deepsweet/Qwen3.6-27B-DFlash-FP16 (FP16 safetensors mirror of z-lab's BF16)
Default: dflash-draft-3.6-q4_k_m.gguf (1.06 GB), faster/lower-memory draft used by current Lucebox quickstarts
Q8_0: dflash-draft-3.6-q8_0.gguf (1.84 GB), kept for conservative parity checks
Arch: qwen35-dflash-draft, 5 layers, hidden 5120, n_target_layers 5, vocab 248320
Tensors: projection weights quantized, norms → F32 (precision-critical, tiny)
Block size: 16, RoPE θ 1e6, RMS ε 1e-6, MASK token id 248070

Files

File	Size	Purpose
`dflash-draft-3.6-q4_k_m.gguf`	1.06 GB	Default/recommended draft model. Pass to dflash via `--draft`
`dflash-draft-3.6-q8_0.gguf`	1.84 GB	Higher-precision draft for parity/debug checks

Usage with the Lucebox dflash engine

# 1. Clone + checkout (PR 129 adds Qwen3.6 SWA support)
git clone https://github.com/Luce-Org/lucebox-hub.git
cd lucebox-hub
git fetch origin pull/129/head:pr129 && git checkout pr129
git submodule update --init --recursive

# 2. Build (sm_86+ enables Block-Sparse Attention; sm_75 falls back to ggml flash_attn_ext)
cd dflash
cmake -B build -S . -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_ARCHITECTURES=86 \
  -DDFLASH27B_ENABLE_BSA=ON \
  -DDFLASH27B_TESTS=ON
cmake --build build --target test_dflash -j

# 3. Get the target (Q4_K_M GGUF) and this draft
mkdir -p models/target models/draft
hf download unsloth/Qwen3.6-27B-GGUF --include "*Q4_K_M*.gguf" --local-dir models/target
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF --include "dflash-draft-3.6-q4_k_m.gguf" --local-dir models/draft

# 4. Run
export DFLASH_TARGET=models/target/Qwen3.6-27B-Q4_K_M.gguf
export DFLASH_DRAFT=models/draft/dflash-draft-3.6-q4_k_m.gguf
echo "Write a haiku about GPUs." | python3 scripts/run.py --max-ctx 2048 --n-gen 256

The binary autodetects .gguf vs .safetensors from the draft path.

Compatibility

Target: any Qwen3.6-27B-Q4_K_M.gguf (e.g. unsloth/Qwen3.6-27B-GGUF)
The DFlash arch (5 layers + dflash.fc.weight + dflash.hidden_norm.weight) is loaded by gguf_draft_loader.cpp. Quantizing this draft requires the matching Lucebox GGUF tooling; do not re-quantize with stock llama-quantize — that won't preserve the dflash-specific tensors.

License & attribution

Apache 2.0, inheriting the upstream z-lab license. Original DFlash work and weights by z-lab; FP16 mirror by deepsweet; GGUF quantization + repackaging by Lucebox.

Downloads last month: 2,708

GGUF

Hardware compatibility

4-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Lucebox/Qwen3.6-27B-DFlash-GGUF

Base model

z-lab/Qwen3.6-27B-DFlash

Quantized

(6)

this model