Stanford AI

university

https://www.ai.stanford.edu

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Papers

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

View all Papers

nouamanetazi

posted an update about 1 month ago

Post

3943

After training 𝐒𝐦𝐨𝐥𝐋𝐌𝟑 on 𝟑𝟖𝟒 𝐇𝟏𝟎𝟎𝐬 for nearly a month, I've come to realize something most people overlook: 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐢𝐬 𝐭𝐡𝐞 𝐦𝐚𝐤𝐞-𝐨𝐫-𝐛𝐫𝐞𝐚𝐤 𝐟𝐚𝐜𝐭𝐨𝐫 𝐢𝐧 𝐋𝐋𝐌 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠. 🔥

Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious 𝐍𝐂𝐂𝐋 𝐞𝐫𝐫𝐨𝐫𝐬, or when your expensive GPU cluster is running at 𝟔𝟎% 𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐜𝐲, the problem isn't your model. It's most probably a 𝐦𝐢𝐬𝐮𝐬𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐡𝐚𝐫𝐝𝐰𝐚𝐫𝐞. 🛠️

Questions that seemed simple but had no clear answers: Why is 𝐌𝐨𝐄 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐬𝐥𝐨𝐰𝐞𝐫 𝐭𝐡𝐚𝐧 𝐝𝐞𝐧𝐬𝐞 𝐦𝐨𝐝𝐞𝐥𝐬? Which 𝐍𝐂𝐂𝐋 𝐟𝐥𝐚𝐠𝐬 should we actually set? How often should we checkpoint without killing throughput?

That's why we built 𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤 📖: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐥𝐚𝐲𝐞𝐫 that most teams get wrong.

We validated real vs theoretical bandwidth across the entire stack: 𝐇𝐁𝐌𝟑 𝐡𝐢𝐭𝐭𝐢𝐧𝐠 𝟑 𝐓𝐁/𝐬, 𝐍𝐕𝐋𝐢𝐧𝐤 𝟒.𝟎 𝐫𝐞𝐚𝐜𝐡𝐢𝐧𝐠 𝟕𝟖𝟔 𝐆𝐁/𝐬, 𝐏𝐂𝐈𝐞 𝐆𝐞𝐧𝟒 𝐚𝐭 𝟏𝟒.𝟐 𝐆𝐁/𝐬. Then we ran collective operations across 𝟏𝟐𝟖 𝐆𝐏𝐔𝐬 (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from 𝟒𝟖𝟎 𝐆𝐁/𝐬 on a single node to 𝟑𝟐𝟎-𝟑𝟓𝟎 𝐆𝐁/𝐬 across 16 nodes.

If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.

𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤: https://lnkd.in/e5MKXUHS

Shared with ❤️ by the HuggingFace team

kzliu

authored a paper 3 months ago

UQ: Assessing Language Models on Unsolved Questions

Paper • 2508.17580 • Published Aug 25 • 15

siddharthmb

authored a paper 7 months ago

BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

Paper • 2505.15216 • Published May 21

Muennighoff

authored 2 papers 7 months ago

Crosslingual Reasoning through Test-Time Scaling

Paper • 2505.05408 • Published May 8 • 8

ReasonIR: Training Retrievers for Reasoning Tasks

Paper • 2504.20595 • Published Apr 29 • 53

nouamanetazi

authored a paper 8 months ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 200

Kameshr

updated a dataset 9 months ago

Stanford/Compiled_COT

Viewer • Updated Mar 15 • 2.23M • 103 • 2

Kameshr

published a dataset 9 months ago

Stanford/Compiled_COT

Viewer • Updated Mar 15 • 2.23M • 103 • 2

ayushchakravarthy

authored a paper 9 months ago

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

Paper • 2503.01307 • Published Mar 3 • 38

Muennighoff

authored a paper 10 months ago

s1: Simple test-time scaling

Paper • 2501.19393 • Published Jan 31 • 124

leamhadzic

authored a paper about 1 year ago

HourVideo: 1-Hour Video-Language Understanding

Paper • 2411.04998 • Published Nov 7, 2024 • 1

brando

authored 3 papers about 1 year ago

Are Emergent Abilities of Large Language Models a Mirage?

Paper • 2304.15004 • Published Apr 28, 2023 • 8

ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment

Paper • 2410.18194 • Published Oct 23, 2024 • 6

Pantograph: A Machine-to-Machine Interaction Interface for Advanced Theorem Proving, High Level Reasoning, and Data Extraction in Lean 4

Paper • 2410.16429 • Published Oct 21, 2024 • 5

EVER-Z

authored 6 papers about 1 year ago

TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data

Paper • 2410.06234 • Published Oct 8, 2024

Segment Any Change

Paper • 2402.01188 • Published Feb 2, 2024 • 1

Changen2: Multi-Temporal Remote Sensing Generative Change Foundation Model

Paper • 2406.17998 • Published Jun 26, 2024 • 1

Single-Temporal Supervised Learning for Universal Remote Sensing Change Detection

Paper • 2406.15694 • Published Jun 22, 2024

Change is Everywhere: Single-Temporal Supervised Object Change Detection in Remote Sensing Imagery

Paper • 2108.07002 • Published Aug 16, 2021

Scalable Multi-Temporal Remote Sensing Change Data Generation via Simulating Stochastic Change Process

Paper • 2309.17031 • Published Sep 29, 2023