BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning
Abstract
Band-constrained Policy Optimization addresses stability issues in reinforcement learning for large language models by replacing fixed clipping with a dynamic probability-aware projection method that prevents entropy collapse.
Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.
Community
This paper introduces BandPO (Band-constrained Policy Optimization), which addresses a critical but often overlooked bottleneck in LLM Reinforcement Learning (such as PPO/GRPO/DAPO).
Why it matters:
The canonical clipping mechanism in PPO/GRPO/DAPO uses fixed bounds. The authors mathematically reveal that this strictly constrains the upward update margin of low-probability actions, which disproportionately suppresses high-advantage tail strategies and induces rapid entropy collapse. Simply relaxing the bounds (like Clip-Higher) leads to training instability.
Key Contributions:
- Dynamic, Probability-Aware Bounds: BandPO replaces fixed clipping with a unified "Band" operator, projecting trust regions defined by $f$-divergences into dynamic clipping intervals.
- Prevents Entropy Collapse: It naturally expands the feasible upward margin for low-probability actions to prevent premature clipping, effectively preserving critical exploration gradients without losing stability.
- Strong Empirical Results: Built on top of the GRPO framework, BandPO consistently outperforms vanilla GRPO and Clip-Higher on mathematical reasoning benchmarks (AMC 2023, AIME 2024/2025) across diverse models including Qwen2.5 (3B, 7B) and DeepSeek-R1-Distill (Llama-8B, Qwen).
We believe this provides a highly effective and theoretically grounded improvement over standard GRPO clipping, which will be very valuable to the open-source LLM post-training community. Code is publicly available!
I'm just starting to read it. But I have a question: is this the same discovery as in DPPO?
https://huggingface.co/papers/2602.04879
The basic RL approach to LLMs requires rethinking, as LLMs rely on calculating probabilities for tokens (of which there can be tens of thousands) and this action space is many times larger than the sets for which RL has been used so far.
me too
wait, the idea of per-action, probability-aware clipping intervals that adapt to the old policy instead of fixed bounds is a slick way to keep tail actions in play. i'm curious how robust the convex optimization stays when you scale to huge vocabularies and try different f-divergences, especially with real-world rlhf noise. the breakdown on arxivlens was solid, found a nice walkthrough here: https://arxivlens.com/PaperView/Details/bandpo-bridging-trust-regions-and-ratio-clipping-via-probability-aware-bounds-for-llm-reinforcement-learning-283-62d2c3b7
Thanks for the thoughtful comment, and thanks for sharing the walkthrough.
This is a very natural concern, but one key point is that the runtime computation in BandPO is not a full high-dimensional optimization over the vocabulary. In our formulation, the trust-region projection can be strictly reduced to a 1D problem parameterized only by the old probability of the target token. The high-dimensional simplex constraint is scalarized into a univariate equation g_f(p, r) = \delta, whose roots give the clipping bounds directly.
So the practical scaling behavior is much lighter than “solving a huge convex program over a 100K+ vocabulary” might suggest. For TV and Pearson \chi^2, the bounds are available in closed form; for KL, the active-regime bound is the unique root of a monotone binding equation, which can be solved efficiently with standard bracketed methods such as bisection or Brent’s method, with global convergence guarantees.
In our implementation, we use a CUDA-parallelized bisection solver for the KL case, so the additional overhead is practical and parallel-friendly in LLM RL training.
Also, regarding “real-world RLHF noise”: that kind of noise can certainly affect the overall optimization dynamics, but it does not directly make the Band bound computation ill-posed. The bound solver itself is a deterministic geometric mapping from (p, \delta, f) to the admissible interval, rather than a noisy inner optimization over rewards. In that sense, the numerical stability issue is much milder than it may initially sound. The broader point of BandPO is exactly to replace fixed clipping with theoretically valid, probability-aware bounds while preserving practical usability.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Rethinking the Trust Region in LLM Reinforcement Learning (2026)
- MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning (2026)
- A Unified Framework for Rethinking Policy Divergence Measures in GRPO (2026)
- QUATRO: Query-Adaptive Trust Region Policy Optimization for LLM Fine-tuning (2026)
- TRE: Encouraging Exploration in the Trust Region (2026)
- Clipping-Free Policy Optimization for Large Language Models (2026)
- Flexible Entropy Control in RLVR with Gradient-Preserving Perspective (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper