Ujjwal Tyagi's picture

Building on HF

Ujjwal Tyagi

Ujjwal-Tyagi

·

AI & ML interests

Chief Scientist at Shirova AI, focused on advancing open-source AI, Experienced in LLM fine-tuning, model architecture, and research, with a strong interest in building scalable and efficient models

Recent Activity

liked a dataset about 1 hour ago

Jackrong/GLM-5.1-Reasoning-1M-Cleaned

repliedto anakin87's post about 6 hours ago

How LLM training with RL Environments works? It all starts with 𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗩𝗲𝗿𝗶𝗳𝗶𝗮𝗯𝗹𝗲 𝗥𝗲𝘄𝗮𝗿𝗱𝘀 - question asked - model generates reasoning + answer - answer checked against ground truth - reward drives RL training In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s) Consider a more complex tic-tac-toe env ❌⭕ It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions (envs can also include tools) --- What happens at training? We use 𝗚𝗿𝗼𝘂𝗽 𝗥𝗲𝗹𝗮𝘁𝗶𝘃𝗲 𝗣𝗼𝗹𝗶𝗰𝘆 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 with a tic-tac-toe env No critic model needed, the group is the baseline Simpler than PPO 1️⃣ Rollout generation: from the same board, model plays N games via sampling 2️⃣ Each game scored with deterministic rewards (win, format, ...) 3️⃣ Mean score computed across the group 4️⃣ Each rollout's advantage = its score minus the group mean 5️⃣ Model updated to favor trajectories above baseline 🔁 Repeat For a deep dive, check out 🌱 https://github.com/anakin87/llm-rl-environments-lil-course a free hands-on course on RL environments for LLMs

reacted to anakin87's post with ❤️ about 6 hours ago

How LLM training with RL Environments works? It all starts with 𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗩𝗲𝗿𝗶𝗳𝗶𝗮𝗯𝗹𝗲 𝗥𝗲𝘄𝗮𝗿𝗱𝘀 - question asked - model generates reasoning + answer - answer checked against ground truth - reward drives RL training In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s) Consider a more complex tic-tac-toe env ❌⭕ It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions (envs can also include tools) --- What happens at training? We use 𝗚𝗿𝗼𝘂𝗽 𝗥𝗲𝗹𝗮𝘁𝗶𝘃𝗲 𝗣𝗼𝗹𝗶𝗰𝘆 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 with a tic-tac-toe env No critic model needed, the group is the baseline Simpler than PPO 1️⃣ Rollout generation: from the same board, model plays N games via sampling 2️⃣ Each game scored with deterministic rewards (win, format, ...) 3️⃣ Mean score computed across the group 4️⃣ Each rollout's advantage = its score minus the group mean 5️⃣ Model updated to favor trajectories above baseline 🔁 Repeat For a deep dive, check out 🌱 https://github.com/anakin87/llm-rl-environments-lil-course a free hands-on course on RL environments for LLMs

View all activity

Organizations

Posts 13

Post

24

We are hiring at Shirova AI. We need AI researchers and engineers to work in our research lab. Shirova AI is a research lab in India, so we can help our researchers move to nearby workspaces or let them work from home without ever coming to the lab. We're building our founding team, so the pay will be good. You can learn, so don't hesitate to mail us at: careers@shirova.com

Articles 1

Article

2

Steering, Not Censoring: A Benchmark Suite for Safe and Creative Open-Source AI

View all Articles

Collections 4

View 4 collections

models 17

Ujjwal-Tyagi/Devstral-Small-2-24B-Instruct-2512

24B • Updated 22 days ago • 19

Ujjwal-Tyagi/Qwen3-Coder-30B-A3B-Instruct

Text Generation • 31B • Updated 22 days ago • 248

Ujjwal-Tyagi/DeepSeek-Coder-V2-Lite-Instruct

16B • Updated 22 days ago • 12

Ujjwal-Tyagi/Hermes-4.3-36B

Text Generation • 36B • Updated 22 days ago • 247

Ujjwal-Tyagi/Seed-OSS-36B-Instruct

Text Generation • 36B • Updated 23 days ago • 173

Ujjwal-Tyagi/ERNIE-4.5-VL-28B-A3B-Thinking

Image-Text-to-Text • 30B • Updated 23 days ago • 16

Ujjwal-Tyagi/DeepSeek-R1-Distill-Qwen-32B

Text Generation • 33B • Updated 23 days ago • 173

Ujjwal-Tyagi/DeepSeek-Prover-V2-7B

7B • Updated 23 days ago • 19

Ujjwal-Tyagi/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Text Generation • 32B • Updated 23 days ago • 2.94k

Ujjwal-Tyagi/Nemotron-Cascade-2-30B-A3B

Text Generation • 32B • Updated 23 days ago • 168

datasets 20

Ujjwal-Tyagi/Hunter-Alpha

Updated 19 days ago • 23

Ujjwal-Tyagi/ai-ml-foundations-book-collection

Viewer • Updated 22 days ago • 21 • 935 • 36

Ujjwal-Tyagi/PHP-Code-Large

Viewer • Updated 22 days ago • 8.07M • 278

Ujjwal-Tyagi/JavaScript-Code-Large

Viewer • Updated 22 days ago • 2.64M • 1.19k

Ujjwal-Tyagi/Java-Code-Large

Viewer • Updated 22 days ago • 10.9M • 1k

Ujjwal-Tyagi/notabug

Viewer • Updated 22 days ago • 12.6M • 48

Ujjwal-Tyagi/moshub

Updated 22 days ago • 25

Ujjwal-Tyagi/gitflic

Viewer • Updated 22 days ago • 5.98M • 37

Ujjwal-Tyagi/jihulab

Viewer • Updated 22 days ago • 1.85M • 23

Ujjwal-Tyagi/gitverse

Viewer • Updated 22 days ago • 2.8M • 25

View 20 datasets