granite-20b-code-instruct-8k-2048-Calibration-FP8
This is a premium FP8 quantized version of ibm-granite/granite-20b-code-instruct-8k featuring rigorous code-optimized multi-dataset calibration for production-grade reliability.
Model Description
| Property | Value |
|---|---|
| Base Model | granite-20b-code-instruct-8k |
| Architecture | Dense (20B parameters) |
| Context Length | 8K tokens |
| Quantization | FP8 (E4M3 format) via llm-compressor |
| Target Hardware | NVIDIA Ada Lovelace & Hopper GPUs |
| Quantization Time | 124.8 minutes (~2.1 hours) |
| Calibration Samples | 2,048 (premium code-optimized) |
Usage
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"TevunahAi/granite-20b-code-instruct-8k-2048-Calibration-FP8",
torch_dtype=torch.float8_e4m3fn,
device_map="auto",
low_cpu_mem_usage=True,
)
tokenizer = AutoTokenizer.from_pretrained("TevunahAi/granite-20b-code-instruct-8k-2048-Calibration-FP8")
# Generate
prompt = "Write a Python function to calculate fibonacci numbers:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
With vLLM (Recommended for production)
from vllm import LLM, SamplingParams
llm = LLM(model="TevunahAi/granite-20b-code-instruct-8k-2048-Calibration-FP8")
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
prompts = ["Write a Python function to calculate fibonacci numbers:"]
outputs = llm.generate(prompts, sampling_params)
Premium Code-Optimized Calibration
This model was quantized using TevunahAi's premium code-focused calibration process:
Calibration Details
- Total Samples: 2,048 (4-8x industry standard)
- Datasets Used: 4 code-focused sources
- Coverage: Comprehensive across coding tasks
| Dataset | Samples | Purpose |
|---|---|---|
| HuggingFaceH4/CodeAlpaca_20K | 512 | Code instruction pairs |
| garage-bAInd/Open-Platypus | 512 | STEM/reasoning (includes code) |
| teknium/OpenHermes-2.5 | 512 | Diverse instructions |
| theblackcat102/evol-codealpaca-v1 | 512 | Evolved code examples |
Why Code-Optimized Calibration?
Most FP8 quantizations use generic chat data for calibration. TevunahAi uses 2,048 samples from 4 code-focused datasets, ensuring:
- ✅ Superior code generation quality
- ✅ Better handling of programming syntax
- ✅ Optimized for multiple languages
- ✅ Accurate completion of complex code
- ✅ Production-grade reliability for coding tasks
For code models, generic calibration isn't enough. TevunahAi uses code-specific data.
Quantization Details
- Target Layers: All Linear layers except lm_head
- Precision: FP8 (E4M3 format)
- Hardware Requirements: NVIDIA Ada Lovelace or Hopper (native FP8) or Ampere with emulation
- VRAM Usage: ~20GB (fits on RTX 4090, A100, or 2x RTX 4080)
Quantization Infrastructure
Quantized on professional hardware optimized for high-quality model compression:
- CPUs: Dual Intel Xeon Max 9480 (224 threads, 128GB HBM2e @ 2000 GB/s)
- Memory: 256GB DDR5-4800 (16 DIMMs, 8-channel per socket, ~614 GB/s)
- Total Memory Bandwidth: ~2,614 GB/s aggregate
- GPU: NVIDIA RTX 5000 Ada Generation (32GB VRAM) with native FP8 support
- Software: Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13 | llm-compressor
Performance Notes
- Quantization time: 124.8 minutes with premium 2048-sample calibration
- Memory reduction:
40GB FP16 → ~20GB FP8 (50% reduction) - Inference speed: 2-3x faster on Ada Lovelace GPUs vs FP16
About IBM Granite Code
Granite-20B-Code is IBM's mid-size enterprise-grade code model, featuring:
- Strong code generation across 100+ programming languages
- Optimized for enterprise coding tasks
- 8K context window (2x the 8B model)
- Excellent balance of capability and efficiency
- Apache 2.0 license
IBM Granite Code Family
TevunahAi provides premium FP8 quantizations for the IBM Granite Code family:
| Model | Parameters | Context | Quantization Time | VRAM Usage |
|---|---|---|---|---|
| granite-8b-code-instruct-4k-2048-Calibration-FP8 | 8B | 4K | 55.8 min | ~8GB |
| granite-20b-code-instruct-8k-2048-Calibration-FP8 (this) | 20B | 8K | 124.8 min | ~20GB |
| granite-34b-code-instruct-8k-2048-Calibration-FP8 | 34B | 8K | Coming soon | ~34GB |
All models calibrated with identical premium 2048-sample code-focused datasets.
Comparison: Standard vs Premium Calibration
TevunahAi offers two quantization tiers for this model:
| Version | Calibration | Samples | Datasets | Use Case |
|---|---|---|---|---|
| Standard FP8 | Basic | 256 | 1 | Quick deployment |
| Premium FP8 (this) | Code-optimized | 2,048 | 4 code-focused | Production-grade |
When to Choose Premium:
- ✅ Production deployments
- ✅ Quality-critical applications
- ✅ API services at scale
- ✅ Benchmarking and evaluation
When Standard is Fine:
- ✅ Quick testing
- ✅ Development/prototyping
- ✅ Resource-constrained environments
- ✅ Non-critical applications
License
Apache 2.0 (same as original model)
Credits
- Original model by IBM Granite
- Quantized by TevunahAi
- Quantization powered by llm-compressor
Why TevunahAi 2048-Calibration FP8?
Task-Optimized Calibration
TevunahAi doesn't use one-size-fits-all calibration:
| Model Type | Calibration Focus |
|---|---|
| Code Models | Code-specific datasets (CodeAlpaca, evol-codealpaca) |
| General Models | Diverse instruction datasets (UltraChat, SlimOrca) |
The right calibration for the right model.
The Difference is in the Details
| Aspect | Standard FP8 | TevunahAi 2048-Calibration FP8 |
|---|---|---|
| Calibration Samples | 128-512 | 2,048 |
| Datasets | Single generic | 4 code-focused |
| Edge Case Handling | Adequate | Superior |
| Code Quality | Good | Excellent |
| Production Ready | Maybe | Absolutely |
Professional Infrastructure
- 2.6 TB/s aggregate memory bandwidth
- 2,048 samples across 4 code-focused datasets
- Quality-first approach over speed
- Enterprise-ready results
- Downloads last month
- 15
Model tree for TevunahAi/granite-20b-code-instruct-8k-2048-Calibration-FP8
Base model
ibm-granite/granite-20b-code-base-8k