Qwen3.6-27B DFlash Draft โ GGUF
GGUF quantizations of the z-lab/Qwen3.6-27B-DFlash draft model, produced for the Lucebox dflash engine (speculative decoding for Qwen3.6-27B-Q4_K_M).
- Source: deepsweet/Qwen3.6-27B-DFlash-FP16 (FP16 safetensors mirror of z-lab's BF16)
- Default:
dflash-draft-3.6-q4_k_m.gguf(1.06 GB), faster/lower-memory draft used by current Lucebox quickstarts - Q8_0:
dflash-draft-3.6-q8_0.gguf(1.84 GB), kept for conservative parity checks - Arch:
qwen35-dflash-draft, 5 layers, hidden 5120, n_target_layers 5, vocab 248320 - Tensors: projection weights quantized, norms โ F32 (precision-critical, tiny)
- Block size: 16, RoPE ฮธ 1e6, RMS ฮต 1e-6, MASK token id 248070
Files
| File | Size | Purpose |
|---|---|---|
dflash-draft-3.6-q4_k_m.gguf |
1.06 GB | Default/recommended draft model. Pass to dflash via --draft |
dflash-draft-3.6-q8_0.gguf |
1.84 GB | Higher-precision draft for parity/debug checks |
Usage with the Lucebox dflash engine
# 1. Clone + checkout (PR 129 adds Qwen3.6 SWA support)
git clone https://github.com/Luce-Org/lucebox-hub.git
cd lucebox-hub
git fetch origin pull/129/head:pr129 && git checkout pr129
git submodule update --init --recursive
# 2. Build (sm_86+ enables Block-Sparse Attention; sm_75 falls back to ggml flash_attn_ext)
cd dflash
cmake -B build -S . -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES=86 \
-DDFLASH27B_ENABLE_BSA=ON \
-DDFLASH27B_TESTS=ON
cmake --build build --target test_dflash -j
# 3. Get the target (Q4_K_M GGUF) and this draft
mkdir -p models/target models/draft
hf download unsloth/Qwen3.6-27B-GGUF --include "*Q4_K_M*.gguf" --local-dir models/target
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF --include "dflash-draft-3.6-q4_k_m.gguf" --local-dir models/draft
# 4. Run
export DFLASH_TARGET=models/target/Qwen3.6-27B-Q4_K_M.gguf
export DFLASH_DRAFT=models/draft/dflash-draft-3.6-q4_k_m.gguf
echo "Write a haiku about GPUs." | python3 scripts/run.py --max-ctx 2048 --n-gen 256
The binary autodetects .gguf vs .safetensors from the draft path.
Compatibility
- Target: any
Qwen3.6-27B-Q4_K_M.gguf(e.g.unsloth/Qwen3.6-27B-GGUF) - The DFlash arch (5 layers +
dflash.fc.weight+dflash.hidden_norm.weight) is loaded bygguf_draft_loader.cpp. Quantizing this draft requires the matching Lucebox GGUF tooling; do not re-quantize with stockllama-quantizeโ that won't preserve the dflash-specific tensors.
License & attribution
Apache 2.0, inheriting the upstream z-lab license. Original DFlash work and weights by z-lab; FP16 mirror by deepsweet; GGUF quantization + repackaging by Lucebox.
- Downloads last month
- 2,708
Hardware compatibility
Log In to add your hardware
4-bit
8-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for Lucebox/Qwen3.6-27B-DFlash-GGUF
Base model
z-lab/Qwen3.6-27B-DFlash