PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization
Ben Rahman
- Year
- 2025
- Access
- Open access
Abstract
Despite Proximal Policy Optimization (PPO) dominating policy gradient methods -- from robotic control to game AI -- its static trust region forces a brittle trade-off: aggressive clipping stifles early exploration, while late-stage updates destabilize convergence. PPO-BR establishes a new paradigm in adaptive RL by fusing exploration and convergence signals into a single bounded trust region -- a theoretically grounded innovation that outperforms five SOTA baselines with less than 2% overhead. This work bridges a critical gap in phase-aware learning, enabling real-world deployment in safety-critical systems like robotic surgery within a single adaptive mechanism. PPO-BR achieves 29.1% faster convergence by combining: (1) entropy-driven expansion (epsilon up) for exploration in high-uncertainty states, and (2) reward-guided contraction (epsilon down) for convergence stability. On six diverse benchmarks (MuJoCo, Atari, sparse-reward), PPO-BR achieves 29.1% faster convergence (p < 0.001), 2.3x lower reward variance than PPO, and less than 1.8% runtime overhead with only five lines of code change. PPO-BR's simplicity and theoretical guarantees make it ready-to-deploy in safety-critical domains -- from surgical robotics to autonomous drones. In contrast to recent methods such as Group Relative Policy Optimization (GRPO), PPO-BR offers a unified entropy-reward mechanism applicable to both language models and general reinforcement learning environments.
Keywords
Related papers
Campbell-Walsh urology
Alan J. Wein editor-in-chief
2012
Principles of Robot Motion: Theory, Algorithms, and Implementations
Howie Choset, Jean‐Claude Latombe
2005
Minimally Invasive versus Abdominal Radical Hysterectomy for Cervical Cancer
Pedro T. Ramírez, Michael Frumovitz, René Pareja +16 more
2018
Guideline for Management of the Clinical T1 Renal Mass
Steven C. Campbell, Andrew C. Novick, Arie S. Belldegrun +9 more
2009