Text Generation
GGUF
gemma4
llama.cpp
unsloth
vision-language-model
conversational
A newer version of this model is available: nphearum/Qwen3.5-4BxOpus-4.7-Code-Reasoning-Distilled-GGUF

Gemma-4-e2b-CodeX-Distill-v1-GGUF

A distilled code-focused variant of Gemma-4 e2b, optimized for efficient local inference using GGUF format. This model is designed for coding assistance, reasoning, and structured generation tasks, with optional “thinking” mode enabled via chat templates.


Example usage:

  • For text only LLMs: llama-cli -hf nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF --jinja
  • For multimodal models: llama-mtmd-cli -hf nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF --jinja

📦 Available Model Files

  • gemma-4-e2b-it.Q8_0.gguf — Quantized model (Q8_0 for high quality)
  • gemma-4-e2b-it.BF16-mmproj.gguf — Multimodal projection (required for full functionality)

🚀 Features

  • Strong code generation & reasoning (CodeX-style distillation)
  • Long context support (tested up to 131k tokens)
  • Optimized for llama.cpp
  • Supports structured chat templates (Jinja-based)
  • Optional “thinking mode” for better reasoning traces

🖥️ Running with llama.cpp

Make sure you’re using a recent build of llama.cpp with:

  • Flash Attention enabled
  • Jinja/chat template support compiled

Start Server

llama-server \
  -m gemma-4-e2b-it.Q8_0.gguf \
  --port 53281 \
  -c 131072 \
  --parallel 1 \
  --flash-attn on \
  --no-context-shift \
  -ngl -1 \
  --jinja \
  --chat-template-kwargs "{\"enable_thinking\": true}" \
  --mmproj gemma-4-e2b-it.BF16-mmproj.gguf

Key Flags Explained

  • -c 131072 → Enables long context (131k tokens)
  • --flash-attn on → Faster attention (requires compatible GPU)
  • -ngl -1 → Offload all layers to GPU
  • --jinja → Enables chat template rendering
  • --chat-template-kwargs → Activates thinking mode
  • --mmproj → Required for multimodal projection

Test Request

curl http://localhost:53281/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Write a Python function to reverse a linked list"}
    ]
  }'

🧠 Notes on Thinking Mode

When enable_thinking=true, the model may:

  • Produce intermediate reasoning steps
  • Improve structured problem solving
  • Slightly increase latency

Disable it if you need faster responses.


🦙 Running with Ollama

Important: ⚠️ Ollama Note for Vision Models, currently does not support separate mmproj files for vision models.

Create a Modelfile:

FROM ./gemma-4-e2b-it.Q8_0.gguf

PARAMETER num_ctx 131072
PARAMETER num_gpu -1
PARAMETER stop "<end_of_turn>"

TEMPLATE """{{ if .System }}<start_of_turn>system
{{ .System }}<end_of_turn>
{{ end }}{{ if .Prompt }}<start_of_turn>user
{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ end }}"""

# Optional: enable reasoning-style outputs
SYSTEM "You are a highly capable coding assistant with strong reasoning ability."

Build & Run

ollama create gemma-4-codex -f Modelfile
ollama run gemma-4-codex

⚙️ Recommended Settings

Use Case Context GPU Layers Notes
Coding assistant 32k–64k Full (-1) Best balance
Long reasoning 131k Full Needs high VRAM
Low VRAM setup 8k–16k Partial Disable flash-attn

⚠️ Limitations

  • Requires significant VRAM for full 131k context
  • Thinking mode increases latency
  • Multimodal projection file must match model variant

📜 License

Follow the original Gemma license and any additional terms from this distillation.


🙌 Credits

  • Base model: Google Gemma family
  • Distillation: Code-focused adaptation
  • Runtime: llama.cpp ecosystem
Downloads last month
1,848
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF

Quantized
(177)
this model

Datasets used to train nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF

Collection including nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF