granite-20b-code-instruct-8k-2048-Calibration-FP8

This is a premium FP8 quantized version of ibm-granite/granite-20b-code-instruct-8k featuring rigorous code-optimized multi-dataset calibration for production-grade reliability.

Model Description

Property Value
Base Model granite-20b-code-instruct-8k
Architecture Dense (20B parameters)
Context Length 8K tokens
Quantization FP8 (E4M3 format) via llm-compressor
Target Hardware NVIDIA Ada Lovelace & Hopper GPUs
Quantization Time 124.8 minutes (~2.1 hours)
Calibration Samples 2,048 (premium code-optimized)

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "TevunahAi/granite-20b-code-instruct-8k-2048-Calibration-FP8",
    torch_dtype=torch.float8_e4m3fn,
    device_map="auto",
    low_cpu_mem_usage=True,
)

tokenizer = AutoTokenizer.from_pretrained("TevunahAi/granite-20b-code-instruct-8k-2048-Calibration-FP8")

# Generate
prompt = "Write a Python function to calculate fibonacci numbers:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With vLLM (Recommended for production)

from vllm import LLM, SamplingParams

llm = LLM(model="TevunahAi/granite-20b-code-instruct-8k-2048-Calibration-FP8")
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

prompts = ["Write a Python function to calculate fibonacci numbers:"]
outputs = llm.generate(prompts, sampling_params)

Premium Code-Optimized Calibration

This model was quantized using TevunahAi's premium code-focused calibration process:

Calibration Details

  • Total Samples: 2,048 (4-8x industry standard)
  • Datasets Used: 4 code-focused sources
  • Coverage: Comprehensive across coding tasks
Dataset Samples Purpose
HuggingFaceH4/CodeAlpaca_20K 512 Code instruction pairs
garage-bAInd/Open-Platypus 512 STEM/reasoning (includes code)
teknium/OpenHermes-2.5 512 Diverse instructions
theblackcat102/evol-codealpaca-v1 512 Evolved code examples

Why Code-Optimized Calibration?

Most FP8 quantizations use generic chat data for calibration. TevunahAi uses 2,048 samples from 4 code-focused datasets, ensuring:

  • ✅ Superior code generation quality
  • ✅ Better handling of programming syntax
  • ✅ Optimized for multiple languages
  • ✅ Accurate completion of complex code
  • ✅ Production-grade reliability for coding tasks

For code models, generic calibration isn't enough. TevunahAi uses code-specific data.

Quantization Details

  • Target Layers: All Linear layers except lm_head
  • Precision: FP8 (E4M3 format)
  • Hardware Requirements: NVIDIA Ada Lovelace or Hopper (native FP8) or Ampere with emulation
  • VRAM Usage: ~20GB (fits on RTX 4090, A100, or 2x RTX 4080)

Quantization Infrastructure

Quantized on professional hardware optimized for high-quality model compression:

  • CPUs: Dual Intel Xeon Max 9480 (224 threads, 128GB HBM2e @ 2000 GB/s)
  • Memory: 256GB DDR5-4800 (16 DIMMs, 8-channel per socket, ~614 GB/s)
  • Total Memory Bandwidth: ~2,614 GB/s aggregate
  • GPU: NVIDIA RTX 5000 Ada Generation (32GB VRAM) with native FP8 support
  • Software: Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13 | llm-compressor

Performance Notes

  • Quantization time: 124.8 minutes with premium 2048-sample calibration
  • Memory reduction: 40GB FP16 → ~20GB FP8 (50% reduction)
  • Inference speed: 2-3x faster on Ada Lovelace GPUs vs FP16

About IBM Granite Code

Granite-20B-Code is IBM's mid-size enterprise-grade code model, featuring:

  • Strong code generation across 100+ programming languages
  • Optimized for enterprise coding tasks
  • 8K context window (2x the 8B model)
  • Excellent balance of capability and efficiency
  • Apache 2.0 license

IBM Granite Code Family

TevunahAi provides premium FP8 quantizations for the IBM Granite Code family:

Model Parameters Context Quantization Time VRAM Usage
granite-8b-code-instruct-4k-2048-Calibration-FP8 8B 4K 55.8 min ~8GB
granite-20b-code-instruct-8k-2048-Calibration-FP8 (this) 20B 8K 124.8 min ~20GB
granite-34b-code-instruct-8k-2048-Calibration-FP8 34B 8K Coming soon ~34GB

All models calibrated with identical premium 2048-sample code-focused datasets.

Comparison: Standard vs Premium Calibration

TevunahAi offers two quantization tiers for this model:

Version Calibration Samples Datasets Use Case
Standard FP8 Basic 256 1 Quick deployment
Premium FP8 (this) Code-optimized 2,048 4 code-focused Production-grade

When to Choose Premium:

  • ✅ Production deployments
  • ✅ Quality-critical applications
  • ✅ API services at scale
  • ✅ Benchmarking and evaluation

When Standard is Fine:

  • ✅ Quick testing
  • ✅ Development/prototyping
  • ✅ Resource-constrained environments
  • ✅ Non-critical applications

License

Apache 2.0 (same as original model)

Credits


Why TevunahAi 2048-Calibration FP8?

Task-Optimized Calibration

TevunahAi doesn't use one-size-fits-all calibration:

Model Type Calibration Focus
Code Models Code-specific datasets (CodeAlpaca, evol-codealpaca)
General Models Diverse instruction datasets (UltraChat, SlimOrca)

The right calibration for the right model.

The Difference is in the Details

Aspect Standard FP8 TevunahAi 2048-Calibration FP8
Calibration Samples 128-512 2,048
Datasets Single generic 4 code-focused
Edge Case Handling Adequate Superior
Code Quality Good Excellent
Production Ready Maybe Absolutely

Professional Infrastructure

  • 2.6 TB/s aggregate memory bandwidth
  • 2,048 samples across 4 code-focused datasets
  • Quality-first approach over speed
  • Enterprise-ready results
Downloads last month
15
Safetensors
Model size
20B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TevunahAi/granite-20b-code-instruct-8k-2048-Calibration-FP8

Collection including TevunahAi/granite-20b-code-instruct-8k-2048-Calibration-FP8