Post training - a bh9052 Collection

bh9052 's Collections

Post training

updated about 21 hours ago

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Paper • 2603.19220 • Published Mar 19 • 69
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Paper • 2605.20164 • Published 26 days ago • 6
GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

Paper • 2605.19577 • Published 26 days ago • 58
EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

Paper • 2605.18703 • Published 27 days ago • 50
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

Paper • 2605.08472 • Published May 8 • 5
Self-Distilled Agentic Reinforcement Learning

Paper • 2605.15155 • Published about 1 month ago • 112
Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis

Paper • 2605.14392 • Published about 1 month ago • 8
RewardHarness: Self-Evolving Agentic Post-Training

Paper • 2605.08703 • Published May 9 • 10
RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Paper • 2605.10899 • Published May 11 • 79
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Paper • 2605.10832 • Published May 11 • 22
Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

Paper • 2605.12070 • Published May 12 • 16
Reward Hacking in Rubric-Based Reinforcement Learning

Paper • 2605.12474 • Published May 12 • 5
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

Paper • 2605.11518 • Published May 12 • 4
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Paper • 2605.09269 • Published May 10 • 6
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning

Paper • 2605.10488 • Published May 11 • 3
Flow-OPD: On-Policy Distillation for Flow Matching Models

Paper • 2605.08063 • Published May 8 • 101
OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

Paper • 2605.04036 • Published May 5 • 69
QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks

Paper • 2605.24218 • Published 23 days ago • 42
Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

Paper • 2605.26086 • Published 20 days ago • 24
RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

Paper • 2605.29156 • Published 18 days ago • 14
Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

Paper • 2605.29648 • Published 17 days ago • 10
LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Paper • 2605.31584 • Published 16 days ago • 41
GrepSeek: Training Search Agents for Direct Corpus Interaction

Paper • 2605.29307 • Published 17 days ago • 106
SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

Paper • 2605.29796 • Published 17 days ago • 25
OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

Paper • 2606.02031 • Published 13 days ago • 20
When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

Paper • 2605.24202 • Published 23 days ago • 17
PaddleOCR-VL-1.6: Expanding the Frontier of Document Parsing with Under-Optimized Region Refinement and Progressive Post-Training

Paper • 2606.03264 • Published 12 days ago • 16
Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

Paper • 2606.04923 • Published 11 days ago • 37
Reinforcement Learning from Rich Feedback with Distributional DAgger

Paper • 2606.05152 • Published 11 days ago • 3
Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

Paper • 2606.05988 • Published 10 days ago • 2
Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback

Paper • 2606.00590 • Published 15 days ago
SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

Paper • 2606.09730 • Published 6 days ago • 49
FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

Paper • 2606.12087 • Published 4 days ago • 71
EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

Paper • 2606.13120 • Published 3 days ago • 4