AI & ML interests
None defined yet.
Recent Activity
TevunahAi - תְּבוּנָה
Hebrew for deep insight and understanding.
The ability to assimilate ideas and make practical use of them. Taking complexity and turning it into clarity—preserving what matters while making it accessible.
What We Do
TevunahAi specializes in professional-grade model quantization optimized for production deployment. We offer two tiers:
FP8 Quantization (Universal Compatibility)
- ~50% memory reduction from FP16
- 2-3x faster inference on NVIDIA GPUs
- 98-99% quality retention
- Works with standard transformers library
- Ideal for: RTX 40xx, RTX 5000/6000 Ada, L40S, H100
Ultra Hybrid Quantization (Maximum Compression)
- Mixed-precision: INT4 + INT8 + FP8
- 60-70% memory reduction from FP16
- 37% smaller than FP8 with equivalent quality
- 98-99% quality retention through smart layer allocation
- Requires vLLM for inference (cutting-edge tech)
- Ideal for: Fitting larger models on consumer/professional GPUs
Ultra Hybrid: Breaking New Ground
The Problem We Solved
Granite models are produced by IBM, one of the most trusted names in enterprise AI. These models can run locally on personal computers, in custom programs, or integrated into IDEs. But there's a challenge: bigger models = bigger RAM requirements.
The Granite-34B code model is a perfect example. Even at FP8 quantization, it requires ~35GB of VRAM. Most professional GPUs max out at 32GB, making this an extremely tight fit, if it fits at all.
The usual solutions?
- Pure INT4 quantization → Severe quality loss for professional use
- MXFP4 (like OpenAI's new models) → Noticeable degradation on larger models
- Accept the limitation → Can't run the model you need
Our Solution: Multi-Level Quantization
Instead of forcing a single precision across the entire model, we asked: "What if different layers need different precision?"
Our Ultra Hybrid approach uses strategic precision allocation:
| Layer Type | Precision | Why |
|---|---|---|
| Critical layers (first/last attention) | FP8 | Foundation and output need precision |
| Bulk processing (middle attention) | INT8 | Balanced performance |
| Feed-forward networks (~67% of params) | INT4 | Massive savings, minimal impact |
| Embeddings & norms | FP16 | Always preserved |
We combine this strategic approach with 2048-sample calibration across four high-quality datasets to minimize quality loss during quantization.
The result? Granite-34B goes from 35GB (FP8) → 21.8GB (Ultra Hybrid) while maintaining near-identical quality.
Real-World Impact
On a 32GB GPU like the NVIDIA RTX 5000 Ada:
- ✅ Model loads comfortably with ~10GB VRAM headroom
- ✅ Generation speed: 20+ tokens/sec (excellent for 34B!)
- ✅ Quality: Production-ready code generation
- ✅ Verified performance on actual hardware
This gives individuals with consumer or professional GPUs the opportunity to run high-quality models without the quality loss that typically comes with aggressive compression.
More Ultra Hybrid models coming soon based on this proven approach.
The TevunahAi Difference
Most quantizations use 256 calibration samples. We use 2048+ diverse samples across multiple high-quality datasets, resulting in:
- ✅ More accurate quantization ranges
- ✅ Better representation of diverse use cases
- ✅ Reduced outlier effects
- ✅ Production-ready quality
Calibration matters. A well-calibrated INT4 layer can outperform a poorly-calibrated FP8 layer. Our 2048-sample approach ensures every precision level performs at its best.
Model Collection
Ultra Hybrid Models (New!)
Maximum compression with professional-grade quality. Requires vLLM.
| Model | Params | Size | vs FP16 | vs FP8 | VRAM | Status |
|---|---|---|---|---|---|---|
| granite-34b-code-instruct-8k-Ultra-Hybrid | 34B | 21.8GB | -68% | -37% | 20.4GB | ✅ Verified |
| Llama-3.1-70B-Instruct-Ultra-Hybrid | 70B | 45.4GB | -68% | -37% | 48GB | ✅ Verified |
FP8 Code Models
Optimized for code generation. Uses code-specific calibration datasets.
| Model | Params | Size | Base Model |
|---|---|---|---|
| granite-34b-code-instruct-8k-FP8 | 34B | 34.7GB | IBM Granite Code 34B |
| granite-20b-code-instruct-8k-FP8 | 20B | ~20GB | IBM Granite Code 20B |
| granite-8b-code-instruct-4k-FP8 | 8B | ~8GB | IBM Granite Code 8B |
| NextCoder-32B-FP8 | 32B | ~32GB | Microsoft NextCoder 32B |
| NextCoder-14B-FP8 | 14B | ~14GB | Microsoft NextCoder 14B |
| NextCoder-7B-FP8 | 7B | ~7GB | Microsoft NextCoder 7B |
Code Calibration Datasets (2048 total samples):
- CodeAlpaca-20K (512 samples) - Code instructions
- Open-Platypus (512 samples) - STEM reasoning
- OpenHermes-2.5 (512 samples) - Instruction following
- evol-codealpaca-v1 (512 samples) - Evolved code tasks
FP8 General-Purpose Models
Optimized for reasoning, chat, and instruction following.
| Model | Params | Size | Base Model |
|---|---|---|---|
| gpt-oss-120b-FP8 | 120B | ~120GB | OpenAI GPT-OSS 120B* |
| Qwen3-Next-80B-FP8 | 80B | ~80GB | Qwen3-Next 80B |
| Apertus-70B-Instruct-FP8 | 70B | ~70GB | Apertus 70B |
| gpt-oss-20b-FP8 | 20B | ~20GB | OpenAI GPT-OSS 20B |
| Apertus-8B-Instruct-FP8 | 8B | ~8GB | Apertus 8B |
General Calibration Datasets (2048 total samples):
- Open-Platypus (512 samples) - STEM reasoning
- UltraChat-200k (512 samples) - Natural conversations
- OpenHermes-2.5 (512 samples) - Instruction following
- SlimOrca (512 samples) - Diverse tasks
*The 120B model uses 1024 calibration samples due to memory constraints during quantization
Usage Examples
Ultra Hybrid Models (vLLM Required)
from vllm import LLM, SamplingParams
# Load Ultra Hybrid model
llm = LLM("TevunahAi/granite-34b-code-instruct-8k-Ultra-Hybrid")
# Configure generation
params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=200
)
# Generate
prompt = "Write a Python function to calculate fibonacci numbers:"
outputs = llm.generate(prompt, params)
print(outputs[0].outputs[0].text)
OpenAI-Compatible API Server:
python -m vllm.entrypoints.openai.api_server \
--model TevunahAi/granite-34b-code-instruct-8k-Ultra-Hybrid \
--host 0.0.0.0 \
--port 8000
Why vLLM? Ultra Hybrid uses mixed-precision quantization (INT4+INT8+FP8), which is cutting-edge technology. The transformers library doesn't yet support this, but vLLM does—with optimized inference kernels (MarlinLinear for INT4, CutlassScaledMM for INT8) for maximum speed.
FP8 Models (Standard Transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load FP8 model
model = AutoModelForCausalLM.from_pretrained(
"TevunahAi/granite-20b-code-instruct-8k-FP8",
torch_dtype=torch.float8_e4m3fn,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"TevunahAi/granite-20b-code-instruct-8k-FP8"
)
# Generate
prompt = "Write a function to reverse a string:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Quantization Infrastructure
Professional hardware for production-quality results:
Compute
CPUs: Dual Intel Xeon Max 9480
- 224 threads (112 physical cores × 2 sockets)
- 128GB HBM2e memory (high-bandwidth on-package)
- AMX (Advanced Matrix Extensions) for accelerated tensor operations
- Optimized with SPREAD thread affinity for dual-socket memory bandwidth
GPU: NVIDIA RTX 5000 Ada Generation
- 32GB GDDR6 VRAM
- Native FP8 Tensor Cores (4th gen)
- TF32 precision enabled for optimal performance
- PCIe Gen 4 interface
Memory: 384GB total system memory
- 256GB DDR5-4800 RAM
- 128GB HBM2e (on-package with CPUs)
- 1.8TB swap for enterprise-grade reliability
Software Stack
- OS: Ubuntu 25.10
- Python: 3.12
- PyTorch: 2.9+ with CUDA 12.8
- Quantization: Neural Magic llmcompressor with compressed-tensors
- Inference: vLLM 0.11+ (Ultra Hybrid), Transformers 4.40+ (FP8)
This infrastructure enables quantization of models up to 120B+ parameters with rigorous 2048-sample calibration and validation.
Hardware Requirements
For FP8 Models
GPU Requirements: NVIDIA GPU with native FP8 Tensor Core support:
- Consumer: RTX 4090, RTX 4080, RTX 4070 Ti
- Professional: RTX 5000 Ada, RTX 6000 Ada, L40S, L4
- Datacenter: H100, H200, GH200 Grace Hopper
Software:
- PyTorch 2.1+
- Transformers 4.40+
- CUDA 11.8+
For Ultra Hybrid Models
Same GPU requirements as FP8, but significantly lower VRAM needs:
- Granite-34B Ultra Hybrid: 24GB VRAM (vs 35GB for FP8)
Software:
- vLLM 0.11+
- PyTorch 2.1+
- CUDA 11.8+
Installation:
pip install vllm
Who Should Use Our Models?
FP8 Models Are Ideal For:
- ✅ Production deployments needing universal compatibility
- ✅ Standard transformers workflows
- ✅ Users with Ada Lovelace or Hopper GPUs
- ✅ Applications requiring proven stability
- ✅ Teams preferring established tooling
Ultra Hybrid Models Are Ideal For:
- ✅ Fitting larger models on smaller GPUs (34B on 24GB!)
- ✅ Maximizing GPU utilization in multi-model setups
- ✅ Cost-sensitive deployments (smaller GPUs = lower hardware costs)
- ✅ Users comfortable with vLLM (industry-standard for high-performance inference)
- ✅ Cutting-edge deployments wanting maximum compression
- ✅ Teams willing to adopt newer inference technologies
Quality Guarantee
Every quantization we release:
✅ Professional calibration: 2048+ samples (8x industry standard)
✅ Quality verified: Tested for coherence, accuracy, and task performance
✅ Enterprise hardware: Quantized on professional-grade infrastructure
✅ Complete documentation: Usage examples, specifications, and performance metrics
✅ License preservation: Inherits and respects original model's license
✅ Industry-standard tools: Neural Magic llmcompressor, vLLM, Transformers
✅ Reproducible process: Documented calibration datasets and methodology
We don't just quantize models—we validate them for real-world use.
Why "TevunahAi"?
In Hebrew, תְּבוּנָה (Tevunah) means deep understanding—not just knowledge, but wisdom integrated and applied throughout life. It's about taking complexity and finding clarity, about seeing the essence of things.
We chose this name because quantization is fundamentally about understanding what makes a model work and preserving that essence in a more efficient form.
Just as Tevunah represents turning chaos into order, our quantizations turn massive models into efficient tools without losing their fundamental capabilities.
The name reminds us: compression without understanding is just data loss. Compression with understanding is optimization.
Connect
- 🤗 HuggingFace: TevunahAi
- 📦 Browse Models: View Collection
- 💬 Discussions: We're open to feedback, requests, and collaboration!
- 📧 Contact: For enterprise inquiries and custom quantization services
Acknowledgments
We're grateful to:
- Neural Magic - For llmcompressor, the foundation of our quantization pipeline
- Microsoft, IBM, Qwen Team, OpenAI - For creating exceptional base models and releasing them openly
- vLLM Team - For enabling cutting-edge mixed-precision inference and pushing the boundaries
- NVIDIA - For building hardware that makes efficient AI practical and accessible
- The open-source AI community - For collaboration, feedback, and driving innovation forward. Also for inspiring individuals to engage openly
- and giving them a place to contribute, learn, and give back.