Organization Card

TevunahAi - תְּבוּנָה

Hebrew for deep insight and understanding.

The ability to assimilate ideas and make practical use of them. Taking complexity and turning it into clarity—preserving what matters while making it accessible.

What We Do

TevunahAi specializes in professional-grade model quantization optimized for production deployment. We offer two tiers:

FP8 Quantization (Universal Compatibility)

~50% memory reduction from FP16
2-3x faster inference on NVIDIA GPUs
98-99% quality retention
Works with standard transformers library
Ideal for: RTX 40xx, RTX 5000/6000 Ada, L40S, H100

Ultra Hybrid Quantization (Maximum Compression)

Mixed-precision: INT4 + INT8 + FP8
60-70% memory reduction from FP16
37% smaller than FP8 with equivalent quality
98-99% quality retention through smart layer allocation
Requires vLLM for inference (cutting-edge tech)
Ideal for: Fitting larger models on consumer/professional GPUs

Ultra Hybrid: Breaking New Ground

The Problem We Solved

Granite models are produced by IBM, one of the most trusted names in enterprise AI. These models can run locally on personal computers, in custom programs, or integrated into IDEs. But there's a challenge: bigger models = bigger RAM requirements.

The Granite-34B code model is a perfect example. Even at FP8 quantization, it requires ~35GB of VRAM. Most professional GPUs max out at 32GB, making this an extremely tight fit, if it fits at all.

The usual solutions?

Pure INT4 quantization → Severe quality loss for professional use
MXFP4 (like OpenAI's new models) → Noticeable degradation on larger models
Accept the limitation → Can't run the model you need

Our Solution: Multi-Level Quantization

Instead of forcing a single precision across the entire model, we asked: "What if different layers need different precision?"

Our Ultra Hybrid approach uses strategic precision allocation:

Layer Type	Precision	Why
Critical layers (first/last attention)	FP8	Foundation and output need precision
Bulk processing (middle attention)	INT8	Balanced performance
Feed-forward networks (~67% of params)	INT4	Massive savings, minimal impact
Embeddings & norms	FP16	Always preserved

We combine this strategic approach with 2048-sample calibration across four high-quality datasets to minimize quality loss during quantization.

The result? Granite-34B goes from 35GB (FP8) → 21.8GB (Ultra Hybrid) while maintaining near-identical quality.

Real-World Impact

On a 32GB GPU like the NVIDIA RTX 5000 Ada:

✅ Model loads comfortably with ~10GB VRAM headroom
✅ Generation speed: 20+ tokens/sec (excellent for 34B!)
✅ Quality: Production-ready code generation
✅ Verified performance on actual hardware

This gives individuals with consumer or professional GPUs the opportunity to run high-quality models without the quality loss that typically comes with aggressive compression.

More Ultra Hybrid models coming soon based on this proven approach.

The TevunahAi Difference

Most quantizations use 256 calibration samples. We use 2048+ diverse samples across multiple high-quality datasets, resulting in:

✅ More accurate quantization ranges
✅ Better representation of diverse use cases
✅ Reduced outlier effects
✅ Production-ready quality

Calibration matters. A well-calibrated INT4 layer can outperform a poorly-calibrated FP8 layer. Our 2048-sample approach ensures every precision level performs at its best.

Model Collection

Ultra Hybrid Models (New!)

Maximum compression with professional-grade quality. Requires vLLM.

Model	Params	Size	vs FP16	vs FP8	VRAM	Status
granite-34b-code-instruct-8k-Ultra-Hybrid	34B	21.8GB	-68%	-37%	20.4GB	✅ Verified
Llama-3.1-70B-Instruct-Ultra-Hybrid	70B	45.4GB	-68%	-37%	48GB	✅ Verified

FP8 Code Models

Optimized for code generation. Uses code-specific calibration datasets.

Model	Params	Size	Base Model
granite-34b-code-instruct-8k-FP8	34B	34.7GB	IBM Granite Code 34B
granite-20b-code-instruct-8k-FP8	20B	~20GB	IBM Granite Code 20B
granite-8b-code-instruct-4k-FP8	8B	~8GB	IBM Granite Code 8B
NextCoder-32B-FP8	32B	~32GB	Microsoft NextCoder 32B
NextCoder-14B-FP8	14B	~14GB	Microsoft NextCoder 14B
NextCoder-7B-FP8	7B	~7GB	Microsoft NextCoder 7B

Code Calibration Datasets (2048 total samples):

CodeAlpaca-20K (512 samples) - Code instructions
Open-Platypus (512 samples) - STEM reasoning
OpenHermes-2.5 (512 samples) - Instruction following
evol-codealpaca-v1 (512 samples) - Evolved code tasks

FP8 General-Purpose Models

Optimized for reasoning, chat, and instruction following.

Model	Params	Size	Base Model
gpt-oss-120b-FP8	120B	~120GB	OpenAI GPT-OSS 120B*
Qwen3-Next-80B-FP8	80B	~80GB	Qwen3-Next 80B
Apertus-70B-Instruct-FP8	70B	~70GB	Apertus 70B
gpt-oss-20b-FP8	20B	~20GB	OpenAI GPT-OSS 20B
Apertus-8B-Instruct-FP8	8B	~8GB	Apertus 8B

General Calibration Datasets (2048 total samples):

Open-Platypus (512 samples) - STEM reasoning
UltraChat-200k (512 samples) - Natural conversations
OpenHermes-2.5 (512 samples) - Instruction following
SlimOrca (512 samples) - Diverse tasks

*The 120B model uses 1024 calibration samples due to memory constraints during quantization

Usage Examples

Ultra Hybrid Models (vLLM Required)

from vllm import LLM, SamplingParams

# Load Ultra Hybrid model
llm = LLM("TevunahAi/granite-34b-code-instruct-8k-Ultra-Hybrid")

# Configure generation
params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=200
)

# Generate
prompt = "Write a Python function to calculate fibonacci numbers:"
outputs = llm.generate(prompt, params)

print(outputs[0].outputs[0].text)

OpenAI-Compatible API Server:

python -m vllm.entrypoints.openai.api_server \
    --model TevunahAi/granite-34b-code-instruct-8k-Ultra-Hybrid \
    --host 0.0.0.0 \
    --port 8000

Why vLLM? Ultra Hybrid uses mixed-precision quantization (INT4+INT8+FP8), which is cutting-edge technology. The transformers library doesn't yet support this, but vLLM does—with optimized inference kernels (MarlinLinear for INT4, CutlassScaledMM for INT8) for maximum speed.

FP8 Models (Standard Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load FP8 model
model = AutoModelForCausalLM.from_pretrained(
    "TevunahAi/granite-20b-code-instruct-8k-FP8",
    torch_dtype=torch.float8_e4m3fn,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(
    "TevunahAi/granite-20b-code-instruct-8k-FP8"
)

# Generate
prompt = "Write a function to reverse a string:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quantization Infrastructure

Professional hardware for production-quality results:

Compute

CPUs: Dual Intel Xeon Max 9480
- 224 threads (112 physical cores × 2 sockets)
- 128GB HBM2e memory (high-bandwidth on-package)
- AMX (Advanced Matrix Extensions) for accelerated tensor operations
- Optimized with SPREAD thread affinity for dual-socket memory bandwidth
GPU: NVIDIA RTX 5000 Ada Generation
- 32GB GDDR6 VRAM
- Native FP8 Tensor Cores (4th gen)
- TF32 precision enabled for optimal performance
- PCIe Gen 4 interface
Memory: 384GB total system memory
- 256GB DDR5-4800 RAM
- 128GB HBM2e (on-package with CPUs)
- 1.8TB swap for enterprise-grade reliability

Software Stack

OS: Ubuntu 25.10
Python: 3.12
PyTorch: 2.9+ with CUDA 12.8
Quantization: Neural Magic llmcompressor with compressed-tensors
Inference: vLLM 0.11+ (Ultra Hybrid), Transformers 4.40+ (FP8)

This infrastructure enables quantization of models up to 120B+ parameters with rigorous 2048-sample calibration and validation.

Hardware Requirements

For FP8 Models

GPU Requirements: NVIDIA GPU with native FP8 Tensor Core support:

Consumer: RTX 4090, RTX 4080, RTX 4070 Ti
Professional: RTX 5000 Ada, RTX 6000 Ada, L40S, L4
Datacenter: H100, H200, GH200 Grace Hopper

Software:

PyTorch 2.1+
Transformers 4.40+
CUDA 11.8+

For Ultra Hybrid Models

Same GPU requirements as FP8, but significantly lower VRAM needs:

Granite-34B Ultra Hybrid: 24GB VRAM (vs 35GB for FP8)

Software:

vLLM 0.11+
PyTorch 2.1+
CUDA 11.8+

Installation:

pip install vllm

Who Should Use Our Models?

FP8 Models Are Ideal For:

✅ Production deployments needing universal compatibility
✅ Standard transformers workflows
✅ Users with Ada Lovelace or Hopper GPUs
✅ Applications requiring proven stability
✅ Teams preferring established tooling

Ultra Hybrid Models Are Ideal For:

✅ Fitting larger models on smaller GPUs (34B on 24GB!)
✅ Maximizing GPU utilization in multi-model setups
✅ Cost-sensitive deployments (smaller GPUs = lower hardware costs)
✅ Users comfortable with vLLM (industry-standard for high-performance inference)
✅ Cutting-edge deployments wanting maximum compression
✅ Teams willing to adopt newer inference technologies

Quality Guarantee

Every quantization we release:

✅ Professional calibration: 2048+ samples (8x industry standard)
✅ Quality verified: Tested for coherence, accuracy, and task performance
✅ Enterprise hardware: Quantized on professional-grade infrastructure
✅ Complete documentation: Usage examples, specifications, and performance metrics
✅ License preservation: Inherits and respects original model's license
✅ Industry-standard tools: Neural Magic llmcompressor, vLLM, Transformers
✅ Reproducible process: Documented calibration datasets and methodology

We don't just quantize models—we validate them for real-world use.

Why "TevunahAi"?

In Hebrew, תְּבוּנָה (Tevunah) means deep understanding—not just knowledge, but wisdom integrated and applied throughout life. It's about taking complexity and finding clarity, about seeing the essence of things.

We chose this name because quantization is fundamentally about understanding what makes a model work and preserving that essence in a more efficient form.

Just as Tevunah represents turning chaos into order, our quantizations turn massive models into efficient tools without losing their fundamental capabilities.

The name reminds us: compression without understanding is just data loss. Compression with understanding is optimization.

Connect

🤗 HuggingFace: TevunahAi
📦 Browse Models: View Collection
💬 Discussions: We're open to feedback, requests, and collaboration!
📧 Contact: For enterprise inquiries and custom quantization services

Acknowledgments

We're grateful to:

Neural Magic - For llmcompressor, the foundation of our quantization pipeline
Microsoft, IBM, Qwen Team, OpenAI - For creating exceptional base models and releasing them openly
vLLM Team - For enabling cutting-edge mixed-precision inference and pushing the boundaries
NVIDIA - For building hardware that makes efficient AI practical and accessible
The open-source AI community - For collaboration, feedback, and driving innovation forward. Also for inspiring individuals to engage openly
and giving them a place to contribute, learn, and give back.

Collections 4

View 4 collections

models 23

datasets 0

None public yet

AI & ML interests

Recent Activity

Team members 1