AI & ML interests

None defined yet.

Recent Activity

rockylynnstein  updated a model about 1 hour ago
TevunahAi/grok-2-FP8
rockylynnstein  published a model about 13 hours ago
TevunahAi/grok-2-FP8
rockylynnstein  updated a collection 1 day ago
FP8 Models.
View all activity

TevunahAi - תְּבוּנָה

Hebrew for deep insight and understanding.

The ability to assimilate ideas and make practical use of them. Taking complexity and turning it into clarity—preserving what matters while making it accessible.


What We Do

TevunahAi specializes in professional-grade model quantization optimized for production deployment. We offer two tiers:

FP8 Quantization (Universal Compatibility)

  • ~50% memory reduction from FP16
  • 2-3x faster inference on NVIDIA GPUs
  • 98-99% quality retention
  • Works with standard transformers library
  • Ideal for: RTX 40xx, RTX 5000/6000 Ada, L40S, H100

Ultra Hybrid Quantization (Maximum Compression)

  • Mixed-precision: INT4 + INT8 + FP8
  • 60-70% memory reduction from FP16
  • 37% smaller than FP8 with equivalent quality
  • 98-99% quality retention through smart layer allocation
  • Requires vLLM for inference (cutting-edge tech)
  • Ideal for: Fitting larger models on consumer/professional GPUs

Ultra Hybrid: Breaking New Ground

The Problem We Solved

Granite models are produced by IBM, one of the most trusted names in enterprise AI. These models can run locally on personal computers, in custom programs, or integrated into IDEs. But there's a challenge: bigger models = bigger RAM requirements.

The Granite-34B code model is a perfect example. Even at FP8 quantization, it requires ~35GB of VRAM. Most professional GPUs max out at 32GB, making this an extremely tight fit, if it fits at all.

The usual solutions?

  • Pure INT4 quantization → Severe quality loss for professional use
  • MXFP4 (like OpenAI's new models) → Noticeable degradation on larger models
  • Accept the limitation → Can't run the model you need

Our Solution: Multi-Level Quantization

Instead of forcing a single precision across the entire model, we asked: "What if different layers need different precision?"

Our Ultra Hybrid approach uses strategic precision allocation:

Layer Type Precision Why
Critical layers (first/last attention) FP8 Foundation and output need precision
Bulk processing (middle attention) INT8 Balanced performance
Feed-forward networks (~67% of params) INT4 Massive savings, minimal impact
Embeddings & norms FP16 Always preserved

We combine this strategic approach with 2048-sample calibration across four high-quality datasets to minimize quality loss during quantization.

The result? Granite-34B goes from 35GB (FP8) → 21.8GB (Ultra Hybrid) while maintaining near-identical quality.

Real-World Impact

On a 32GB GPU like the NVIDIA RTX 5000 Ada:

  • ✅ Model loads comfortably with ~10GB VRAM headroom
  • ✅ Generation speed: 20+ tokens/sec (excellent for 34B!)
  • ✅ Quality: Production-ready code generation
  • ✅ Verified performance on actual hardware

This gives individuals with consumer or professional GPUs the opportunity to run high-quality models without the quality loss that typically comes with aggressive compression.

More Ultra Hybrid models coming soon based on this proven approach.


The TevunahAi Difference

Most quantizations use 256 calibration samples. We use 2048+ diverse samples across multiple high-quality datasets, resulting in:

  • ✅ More accurate quantization ranges
  • ✅ Better representation of diverse use cases
  • ✅ Reduced outlier effects
  • ✅ Production-ready quality

Calibration matters. A well-calibrated INT4 layer can outperform a poorly-calibrated FP8 layer. Our 2048-sample approach ensures every precision level performs at its best.


Model Collection

Ultra Hybrid Models (New!)

Maximum compression with professional-grade quality. Requires vLLM.

Model Params Size vs FP16 vs FP8 VRAM Status
granite-34b-code-instruct-8k-Ultra-Hybrid 34B 21.8GB -68% -37% 20.4GB ✅ Verified
Llama-3.1-70B-Instruct-Ultra-Hybrid 70B 45.4GB -68% -37% 48GB ✅ Verified

FP8 Code Models

Optimized for code generation. Uses code-specific calibration datasets.

Model Params Size Base Model
granite-34b-code-instruct-8k-FP8 34B 34.7GB IBM Granite Code 34B
granite-20b-code-instruct-8k-FP8 20B ~20GB IBM Granite Code 20B
granite-8b-code-instruct-4k-FP8 8B ~8GB IBM Granite Code 8B
NextCoder-32B-FP8 32B ~32GB Microsoft NextCoder 32B
NextCoder-14B-FP8 14B ~14GB Microsoft NextCoder 14B
NextCoder-7B-FP8 7B ~7GB Microsoft NextCoder 7B

Code Calibration Datasets (2048 total samples):

  • CodeAlpaca-20K (512 samples) - Code instructions
  • Open-Platypus (512 samples) - STEM reasoning
  • OpenHermes-2.5 (512 samples) - Instruction following
  • evol-codealpaca-v1 (512 samples) - Evolved code tasks

FP8 General-Purpose Models

Optimized for reasoning, chat, and instruction following.

Model Params Size Base Model
gpt-oss-120b-FP8 120B ~120GB OpenAI GPT-OSS 120B*
Qwen3-Next-80B-FP8 80B ~80GB Qwen3-Next 80B
Apertus-70B-Instruct-FP8 70B ~70GB Apertus 70B
gpt-oss-20b-FP8 20B ~20GB OpenAI GPT-OSS 20B
Apertus-8B-Instruct-FP8 8B ~8GB Apertus 8B

General Calibration Datasets (2048 total samples):

  • Open-Platypus (512 samples) - STEM reasoning
  • UltraChat-200k (512 samples) - Natural conversations
  • OpenHermes-2.5 (512 samples) - Instruction following
  • SlimOrca (512 samples) - Diverse tasks

*The 120B model uses 1024 calibration samples due to memory constraints during quantization


Usage Examples

Ultra Hybrid Models (vLLM Required)

from vllm import LLM, SamplingParams

# Load Ultra Hybrid model
llm = LLM("TevunahAi/granite-34b-code-instruct-8k-Ultra-Hybrid")

# Configure generation
params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=200
)

# Generate
prompt = "Write a Python function to calculate fibonacci numbers:"
outputs = llm.generate(prompt, params)

print(outputs[0].outputs[0].text)

OpenAI-Compatible API Server:

python -m vllm.entrypoints.openai.api_server \
    --model TevunahAi/granite-34b-code-instruct-8k-Ultra-Hybrid \
    --host 0.0.0.0 \
    --port 8000

Why vLLM? Ultra Hybrid uses mixed-precision quantization (INT4+INT8+FP8), which is cutting-edge technology. The transformers library doesn't yet support this, but vLLM does—with optimized inference kernels (MarlinLinear for INT4, CutlassScaledMM for INT8) for maximum speed.

FP8 Models (Standard Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load FP8 model
model = AutoModelForCausalLM.from_pretrained(
    "TevunahAi/granite-20b-code-instruct-8k-FP8",
    torch_dtype=torch.float8_e4m3fn,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(
    "TevunahAi/granite-20b-code-instruct-8k-FP8"
)

# Generate
prompt = "Write a function to reverse a string:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quantization Infrastructure

Professional hardware for production-quality results:

Compute

  • CPUs: Dual Intel Xeon Max 9480

    • 224 threads (112 physical cores × 2 sockets)
    • 128GB HBM2e memory (high-bandwidth on-package)
    • AMX (Advanced Matrix Extensions) for accelerated tensor operations
    • Optimized with SPREAD thread affinity for dual-socket memory bandwidth
  • GPU: NVIDIA RTX 5000 Ada Generation

    • 32GB GDDR6 VRAM
    • Native FP8 Tensor Cores (4th gen)
    • TF32 precision enabled for optimal performance
    • PCIe Gen 4 interface
  • Memory: 384GB total system memory

    • 256GB DDR5-4800 RAM
    • 128GB HBM2e (on-package with CPUs)
    • 1.8TB swap for enterprise-grade reliability

Software Stack

  • OS: Ubuntu 25.10
  • Python: 3.12
  • PyTorch: 2.9+ with CUDA 12.8
  • Quantization: Neural Magic llmcompressor with compressed-tensors
  • Inference: vLLM 0.11+ (Ultra Hybrid), Transformers 4.40+ (FP8)

This infrastructure enables quantization of models up to 120B+ parameters with rigorous 2048-sample calibration and validation.


Hardware Requirements

For FP8 Models

GPU Requirements: NVIDIA GPU with native FP8 Tensor Core support:

  • Consumer: RTX 4090, RTX 4080, RTX 4070 Ti
  • Professional: RTX 5000 Ada, RTX 6000 Ada, L40S, L4
  • Datacenter: H100, H200, GH200 Grace Hopper

Software:

  • PyTorch 2.1+
  • Transformers 4.40+
  • CUDA 11.8+

For Ultra Hybrid Models

Same GPU requirements as FP8, but significantly lower VRAM needs:

  • Granite-34B Ultra Hybrid: 24GB VRAM (vs 35GB for FP8)

Software:

  • vLLM 0.11+
  • PyTorch 2.1+
  • CUDA 11.8+

Installation:

pip install vllm

Who Should Use Our Models?

FP8 Models Are Ideal For:

  • ✅ Production deployments needing universal compatibility
  • ✅ Standard transformers workflows
  • ✅ Users with Ada Lovelace or Hopper GPUs
  • ✅ Applications requiring proven stability
  • ✅ Teams preferring established tooling

Ultra Hybrid Models Are Ideal For:

  • ✅ Fitting larger models on smaller GPUs (34B on 24GB!)
  • ✅ Maximizing GPU utilization in multi-model setups
  • ✅ Cost-sensitive deployments (smaller GPUs = lower hardware costs)
  • ✅ Users comfortable with vLLM (industry-standard for high-performance inference)
  • ✅ Cutting-edge deployments wanting maximum compression
  • ✅ Teams willing to adopt newer inference technologies

Quality Guarantee

Every quantization we release:

Professional calibration: 2048+ samples (8x industry standard)
Quality verified: Tested for coherence, accuracy, and task performance
Enterprise hardware: Quantized on professional-grade infrastructure
Complete documentation: Usage examples, specifications, and performance metrics
License preservation: Inherits and respects original model's license
Industry-standard tools: Neural Magic llmcompressor, vLLM, Transformers
Reproducible process: Documented calibration datasets and methodology

We don't just quantize models—we validate them for real-world use.

Why "TevunahAi"?

In Hebrew, תְּבוּנָה (Tevunah) means deep understanding—not just knowledge, but wisdom integrated and applied throughout life. It's about taking complexity and finding clarity, about seeing the essence of things.

We chose this name because quantization is fundamentally about understanding what makes a model work and preserving that essence in a more efficient form.

Just as Tevunah represents turning chaos into order, our quantizations turn massive models into efficient tools without losing their fundamental capabilities.

The name reminds us: compression without understanding is just data loss. Compression with understanding is optimization.


Connect

  • 🤗 HuggingFace: TevunahAi
  • 📦 Browse Models: View Collection
  • 💬 Discussions: We're open to feedback, requests, and collaboration!
  • 📧 Contact: For enterprise inquiries and custom quantization services

Acknowledgments

We're grateful to:

  • Neural Magic - For llmcompressor, the foundation of our quantization pipeline
  • Microsoft, IBM, Qwen Team, OpenAI - For creating exceptional base models and releasing them openly
  • vLLM Team - For enabling cutting-edge mixed-precision inference and pushing the boundaries
  • NVIDIA - For building hardware that makes efficient AI practical and accessible
  • The open-source AI community - For collaboration, feedback, and driving innovation forward. Also for inspiring individuals to engage openly
  • and giving them a place to contribute, learn, and give back.

datasets 0

None public yet