granite-34b-code-instruct-8k-Ultra-Hybrid
TevunahAi Professional-Grade Ultra Hybrid Quantization
Enterprise-quality hybrid quantization of IBM Granite 34B Code with 2048-sample calibration (8x industry standard)
⚠️ Requires vLLM for inference - Standard transformers library does not yet support mixed-precision quantization
Model Info
IBM Granite 34B Code is a decoder-only code model trained on 116 programming languages, optimized for code generation, bug fixing, code explanation, and documentation.
Quantization Strategy (GPTBigCode Architecture)
| Layer Type | Precision | Rationale |
|---|---|---|
| Embeddings (wte, wpe) | FP16 | Preserved for quality |
| First 2 Attention | FP8 | Foundation layers need precision |
| Middle Attention (Layers 2-85) | W8A8 (INT8) | Balanced performance |
| Last 2 Attention | FP8 | Output precision critical |
| ALL MLP Layers | W4A16 (INT4) | ~67% of parameters - massive savings |
| lm_head | FP16 | Output head preserved |
| LayerNorms | FP16 | Normalization layers preserved |
Why This Works
- MLP layers constitute ~67% of Granite-34B's parameters
- INT4 on MLPs provides massive compression with minimal quality loss
- FP8/INT8 on attention maintains code reasoning capability
- Result: 21.8GB model size vs 68GB FP16, with 98-99% quality retention
Quantization Details
| Property | Value |
|---|---|
| Base Model | ibm-granite/granite-34b-code-instruct-8k |
| Architecture | GPTBigCode (StarCoder variant) |
| Method | Ultra Hybrid (W4A16 + W8A8 + FP8) |
| Total Layers | 88 |
| Calibration Samples | 2,048 (Professional Grade) |
| Calibration Datasets | Open-Platypus, UltraChat-200k, OpenHermes-2.5, SlimOrca |
| Hardware | Dual Xeon Max 9480 + RTX 5000 Ada |
| Optimizations | AMX (Sapphire Rapids), TF32 (Ada Lovelace) |
| Model Size | 21.8GB |
| VRAM Usage | 20.4GB (with vLLM) |
Compression Comparison
| Version | Size | Compression vs FP16 | Compression vs FP8 | Quality |
|---|---|---|---|---|
| FP16 | 68GB | 1.0x | - | 100% |
| FP8 | 34.7GB | 1.96x | 1.0x | 98-99% |
| Ultra Hybrid | 21.8GB | 3.12x | 1.59x | 98-99% |
Verified Performance
Tested on NVIDIA RTX 5000 Ada (32GB):
- ✅ VRAM Usage: 20.4 GiB
- ✅ Generation Speed: 20+ tokens/sec
- ✅ Loading Time: ~5 seconds (cached)
- ✅ Quality: Production-ready code generation
- ✅ Optimized Kernels: MarlinLinear (INT4) + CutlassScaledMM (INT8)
Usage
Installation:
pip install vllm
Basic Inference:
from vllm import LLM, SamplingParams
# Load model
llm = LLM("TevunahAi/granite-34b-code-instruct-8k-Ultra-Hybrid")
# Configure generation
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=200
)
# Generate
prompts = ["Write a Python function to calculate fibonacci numbers:"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
OpenAI-Compatible API Server:
python -m vllm.entrypoints.openai.api_server \
--model TevunahAi/granite-34b-code-instruct-8k-Ultra-Hybrid \
--host 0.0.0.0 \
--port 8000
Then use with OpenAI client:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123" # dummy key
)
response = client.completions.create(
model="TevunahAi/granite-34b-code-instruct-8k-Ultra-Hybrid",
prompt="Write a function to reverse a string:",
max_tokens=100
)
print(response.choices[0].text)
Why vLLM is Required
This model uses mixed-precision quantization (INT4 + INT8 + FP8 combined), which is cutting-edge technology. Currently:
- ❌ transformers: Does not support mixed-precision compressed-tensors
- ✅ vLLM: Full support with optimized inference kernels
- ✅ llmcompressor: Can load but slower than vLLM
As the ecosystem matures, transformers support will likely be added. For now, vLLM provides the best performance.
TevunahAi Professional Standard
Unlike consumer-grade quantizations that use 256 calibration samples, TevunahAi uses 2,048 diverse samples (8x industry baseline) to ensure:
- ✅ More accurate quantization ranges
- ✅ Better representation of diverse use cases and programming patterns
- ✅ Reduced outlier effects in code generation
- ✅ Professional-grade quality suitable for production deployment
- ✅ Verified quality through extensive testing
Hardware Requirements
| Use Case | VRAM | Notes |
|---|---|---|
| Minimum | 24GB | RTX 4090, RTX A5000 |
| Recommended | 32GB | RTX 5000 Ada, A100 40GB |
| Optimal | 48GB+ | RTX 6000 Ada, A100 80GB |
Model Characteristics
Strengths:
- Code generation across 116 languages
- Bug detection and fixing
- Code explanation and documentation
- API usage and integration examples
- 37% smaller than FP8-only quantization
- Fast inference with optimized kernels
Best For:
- Production code completion services
- AI-powered IDEs
- Code review and analysis tools
- Educational coding assistants
- API documentation generation
Citation
@misc{granite-34b-ultra-hybrid,
author = {TevunahAi},
title = {Granite 34B Code Ultra Hybrid Quantization},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/TevunahAi/granite-34b-code-instruct-8k-Ultra-Hybrid}},
note = {Professional-grade mixed-precision quantization with 2048-sample calibration}
}
License
This model inherits the Apache 2.0 License from IBM Granite.
- Downloads last month
- 9
Model tree for TevunahAi/granite-34b-code-instruct-8k-Ultra-Hybrid
Base model
ibm-granite/granite-34b-code-base-8k