Ujjwal Tyagi's picture
Building on HF

Ujjwal Tyagi

Ujjwal-Tyagi

AI & ML interests

Chief Scientist at Shirova AI, focused on advancing open-source AI, Experienced in LLM fine-tuning, model architecture, and research, with a strong interest in building scalable and efficient models

Recent Activity

liked a dataset about 1 hour ago
Jackrong/GLM-5.1-Reasoning-1M-Cleaned
repliedto anakin87's post about 6 hours ago
How LLM training with RL Environments works? It all starts with š—„š—²š—¶š—»š—³š—¼š—æš—°š—²š—ŗš—²š—»š˜ š—Ÿš—²š—®š—æš—»š—¶š—»š—“ š˜„š—¶š˜š—µ š—©š—²š—æš—¶š—³š—¶š—®š—Æš—¹š—² š—„š—²š˜„š—®š—æš—±š˜€ - question asked - model generates reasoning + answer - answer checked against ground truth - reward drives RL training In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s) Consider a more complex tic-tac-toe env āŒā­• It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions (envs can also include tools) --- What happens at training? We use š—šš—æš—¼š˜‚š—½ š—„š—²š—¹š—®š˜š—¶š˜ƒš—² š—£š—¼š—¹š—¶š—°š˜† š—¢š—½š˜š—¶š—ŗš—¶š˜‡š—®š˜š—¶š—¼š—» with a tic-tac-toe env No critic model needed, the group is the baseline Simpler than PPO 1ļøāƒ£ Rollout generation: from the same board, model plays N games via sampling 2ļøāƒ£ Each game scored with deterministic rewards (win, format, ...) 3ļøāƒ£ Mean score computed across the group 4ļøāƒ£ Each rollout's advantage = its score minus the group mean 5ļøāƒ£ Model updated to favor trajectories above baseline šŸ” Repeat For a deep dive, check out 🌱 https://github.com/anakin87/llm-rl-environments-lil-course a free hands-on course on RL environments for LLMs
reacted to anakin87's post with ā¤ļø about 6 hours ago
How LLM training with RL Environments works? It all starts with š—„š—²š—¶š—»š—³š—¼š—æš—°š—²š—ŗš—²š—»š˜ š—Ÿš—²š—®š—æš—»š—¶š—»š—“ š˜„š—¶š˜š—µ š—©š—²š—æš—¶š—³š—¶š—®š—Æš—¹š—² š—„š—²š˜„š—®š—æš—±š˜€ - question asked - model generates reasoning + answer - answer checked against ground truth - reward drives RL training In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s) Consider a more complex tic-tac-toe env āŒā­• It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions (envs can also include tools) --- What happens at training? We use š—šš—æš—¼š˜‚š—½ š—„š—²š—¹š—®š˜š—¶š˜ƒš—² š—£š—¼š—¹š—¶š—°š˜† š—¢š—½š˜š—¶š—ŗš—¶š˜‡š—®š˜š—¶š—¼š—» with a tic-tac-toe env No critic model needed, the group is the baseline Simpler than PPO 1ļøāƒ£ Rollout generation: from the same board, model plays N games via sampling 2ļøāƒ£ Each game scored with deterministic rewards (win, format, ...) 3ļøāƒ£ Mean score computed across the group 4ļøāƒ£ Each rollout's advantage = its score minus the group mean 5ļøāƒ£ Model updated to favor trajectories above baseline šŸ” Repeat For a deep dive, check out 🌱 https://github.com/anakin87/llm-rl-environments-lil-course a free hands-on course on RL environments for LLMs
View all activity

Organizations

AI FILMS's profile picture GEM benchmark's profile picture MusicAI's profile picture Open-Source AI Meetup's profile picture Chinese-Vicuna's profile picture East China Normal University's profile picture Keras Dreambooth Event's profile picture Stable Diffusion Dreambooth Concepts Library's profile picture Binghamton University's profile picture Blog-explorers's profile picture huggingPartyParis's profile picture LocalLLaMA's profile picture MLX Community's profile picture ONNX Community's profile picture Hugging Face Discord Community's profile picture LeRobot Worldwide Hackathon's profile picture Hugging Face MCP Course's profile picture Robotics Course's profile picture Hugging Science's profile picture Shirova AI's profile picture MCP-1st-Birthday's profile picture