Qwen3.6-27B DFlash Draft โ€” GGUF

GGUF quantizations of the z-lab/Qwen3.6-27B-DFlash draft model, produced for the Lucebox dflash engine (speculative decoding for Qwen3.6-27B-Q4_K_M).

  • Source: deepsweet/Qwen3.6-27B-DFlash-FP16 (FP16 safetensors mirror of z-lab's BF16)
  • Default: dflash-draft-3.6-q4_k_m.gguf (1.06 GB), faster/lower-memory draft used by current Lucebox quickstarts
  • Q8_0: dflash-draft-3.6-q8_0.gguf (1.84 GB), kept for conservative parity checks
  • Arch: qwen35-dflash-draft, 5 layers, hidden 5120, n_target_layers 5, vocab 248320
  • Tensors: projection weights quantized, norms โ†’ F32 (precision-critical, tiny)
  • Block size: 16, RoPE ฮธ 1e6, RMS ฮต 1e-6, MASK token id 248070

Files

File Size Purpose
dflash-draft-3.6-q4_k_m.gguf 1.06 GB Default/recommended draft model. Pass to dflash via --draft
dflash-draft-3.6-q8_0.gguf 1.84 GB Higher-precision draft for parity/debug checks

Usage with the Lucebox dflash engine

# 1. Clone + checkout (PR 129 adds Qwen3.6 SWA support)
git clone https://github.com/Luce-Org/lucebox-hub.git
cd lucebox-hub
git fetch origin pull/129/head:pr129 && git checkout pr129
git submodule update --init --recursive

# 2. Build (sm_86+ enables Block-Sparse Attention; sm_75 falls back to ggml flash_attn_ext)
cd dflash
cmake -B build -S . -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_ARCHITECTURES=86 \
  -DDFLASH27B_ENABLE_BSA=ON \
  -DDFLASH27B_TESTS=ON
cmake --build build --target test_dflash -j

# 3. Get the target (Q4_K_M GGUF) and this draft
mkdir -p models/target models/draft
hf download unsloth/Qwen3.6-27B-GGUF --include "*Q4_K_M*.gguf" --local-dir models/target
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF --include "dflash-draft-3.6-q4_k_m.gguf" --local-dir models/draft

# 4. Run
export DFLASH_TARGET=models/target/Qwen3.6-27B-Q4_K_M.gguf
export DFLASH_DRAFT=models/draft/dflash-draft-3.6-q4_k_m.gguf
echo "Write a haiku about GPUs." | python3 scripts/run.py --max-ctx 2048 --n-gen 256

The binary autodetects .gguf vs .safetensors from the draft path.

Compatibility

  • Target: any Qwen3.6-27B-Q4_K_M.gguf (e.g. unsloth/Qwen3.6-27B-GGUF)
  • The DFlash arch (5 layers + dflash.fc.weight + dflash.hidden_norm.weight) is loaded by gguf_draft_loader.cpp. Quantizing this draft requires the matching Lucebox GGUF tooling; do not re-quantize with stock llama-quantize โ€” that won't preserve the dflash-specific tensors.

License & attribution

Apache 2.0, inheriting the upstream z-lab license. Original DFlash work and weights by z-lab; FP16 mirror by deepsweet; GGUF quantization + repackaging by Lucebox.

Downloads last month
2,708
GGUF
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Lucebox/Qwen3.6-27B-DFlash-GGUF

Quantized
(6)
this model