view article Article Welcome Gemma 4: Frontier multimodal intelligence on device +5 13 days ago • 841
view article Article GGML and llama.cpp join HF to ensure the long-term progress of Local AI +4 Feb 20 • 501
Efficient Memory Management for Large Language Model Serving with PagedAttention Paper • 2309.06180 • Published Sep 12, 2023 • 51
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion Paper • 2503.11576 • Published Mar 14, 2025 • 156
view article Article Performant local mixture-of-experts CPU inference with GPU acceleration in llama.cpp Jan 30 • 17
view article Article LightOnOCR-2-1B: a lightweight high-performance end-to-end OCR model family Jan 19 • 91
view article Article Tokenization in Transformers v5: Simpler, Clearer, and More Modular +4 Dec 18, 2025 • 124
view article Article Shrinking Giants: The Quantization Mathematics Making LLMs Accessible May 3, 2025 • 2
view article Article A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using transformers, accelerate and bitsandbytes Aug 17, 2022 • 129