gpt-oss-120b-1024-Calibration-FP8
This is a premium FP8 quantized version of openai/gpt-oss-120b featuring rigorous multi-dataset calibration for production-grade reliability.
Model Description
| Property | Value |
|---|---|
| Base Model | gpt-oss-120b |
| Architecture | Mixture of Experts (120B total, 5B active per token) |
| Quantization | FP8 (E4M3 format) via llm-compressor |
| Target Hardware | NVIDIA Ada Lovelace & Hopper GPUs |
| Quantization Time | 78.7 minutes (~1.3 hours) |
| Calibration Samples | 1,024 (premium multi-dataset) |
Usage
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"TevunahAi/gpt-oss-120b-1024-Calibration-FP8",
torch_dtype=torch.float8_e4m3fn,
device_map="auto",
low_cpu_mem_usage=True,
)
tokenizer = AutoTokenizer.from_pretrained("TevunahAi/gpt-oss-120b-1024-Calibration-FP8")
# Generate
messages = [{"role": "user", "content": "Explain quantum computing"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
With vLLM (Recommended for production)
from vllm import LLM, SamplingParams
llm = LLM(model="TevunahAi/gpt-oss-120b-1024-Calibration-FP8")
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
prompts = ["Explain quantum computing"]
outputs = llm.generate(prompts, sampling_params)
Premium Calibration
This model was quantized using TevunahAi's premium multi-dataset calibration process:
Calibration Details
- Total Samples: 1,024 (4x industry standard)
- Datasets Used: 4 complementary sources
- Coverage: Comprehensive across all use cases
| Dataset | Samples | Purpose |
|---|---|---|
| Open-Platypus | 256 | STEM reasoning and logic |
| UltraChat-200k | 256 | Natural conversations |
| OpenHermes-2.5 | 256 | Instruction following |
| SlimOrca | 256 | Diverse general tasks |
Why Premium Calibration?
Most FP8 quantizations use 128-512 samples from a single dataset. TevunahAi uses 1,024 samples across 4 diverse datasets, ensuring:
- ✅ Superior robustness across task types
- ✅ Better statistical coverage for quantization scales
- ✅ Minimal quality loss compared to FP16
- ✅ Production-grade reliability
- ✅ Consistent performance on edge cases
When quality matters, choose TevunahAi Calibration FP8 quantizations.
Model Architecture
GPT-OSS-120B uses a Mixture of Experts (MoE) architecture:
| Property | Value |
|---|---|
| Total Parameters | 120B |
| Active Parameters | 5B per token |
| Architecture | MoE (Mixture of Experts) |
| Benefit | 120B capability with 5B inference cost |
Why MoE?
- Inference speed of a ~5B model
- Capability of a 120B model
- Optimal memory/performance trade-off
- Efficient expert routing
Quantization Details
- Target Layers: All Linear layers except lm_head
- Precision: FP8 (E4M3 format)
- Hardware Requirements: NVIDIA Ada Lovelace or Hopper (native FP8) or Ampere with emulation
- VRAM Usage: ~60GB (fits on 2x RTX 4090 or 1x A100 80GB)
Quantization Infrastructure
Quantized on professional hardware optimized for high-quality model compression:
- CPUs: Dual Intel Xeon Max 9480 (224 threads, 128GB HBM2e @ 2000 GB/s)
- Memory: 256GB DDR5-4800 (16 DIMMs, 8-channel per socket, ~614 GB/s)
- Total Memory Bandwidth: ~2,614 GB/s aggregate
- Peak Memory Usage: ~310GB during quantization
- GPU: NVIDIA RTX 5000 Ada Generation (32GB VRAM) with native FP8 support
- Software: Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13 | llm-compressor
This infrastructure enables rigorous multi-dataset calibration of 100B+ parameter models that would be impossible on standard hardware.
Performance Notes
- Quantization time: 78.7 minutes with premium 1024-sample calibration
- Memory during quantization: ~310GB (model + calibration datasets)
- Memory reduction:
240GB FP16 → ~120GB FP8 (50% reduction) - Inference speed: 2-3x faster on Ada Lovelace GPUs vs FP16
About GPT-OSS
GPT-OSS-120B is OpenAI's flagship open-source model release, featuring:
- State-of-the-art performance across benchmarks
- Efficient MoE architecture (120B total, 5B active)
- Strong reasoning and instruction following
- Apache 2.0 license
License
Apache 2.0 (same as original model)
Credits
- Original model by OpenAI
- Quantized by TevunahAi
- Quantization powered by llm-compressor
Why TevunahAi Calibration FP8?
The Difference is in the Details
| Aspect | Standard FP8 | TevunahAi 1024-Calibration FP8 |
|---|---|---|
| Calibration Samples | 128-256 | 1,024 |
| Datasets | Single | 4 diverse |
| Edge Case Handling | Adequate | Superior |
| Output Consistency | Good | Excellent |
| Production Ready | Maybe | Absolutely |
Professional Infrastructure
- 2.6 TB/s aggregate memory bandwidth
- 1,024 samples across 4 complementary datasets
- Quality-first approach over speed
- Enterprise-ready results
Pushing the Limits
This 120B MoE model required ~310GB of RAM during quantization — pushing our professional hardware to its limits. This quantization would be impossible on consumer hardware.
As of 11/26/2025 at 13:58 ALL M0DEL FILES FROM FP8 ARE UPLOADED.
- Downloads last month
- 41
Model tree for TevunahAi/gpt-oss-120b-1024-Calibration-FP8
Base model
openai/gpt-oss-120b