RL Token: Bootstrapping Online RL with Vision-Language-Action Models
Charles Xu, Jost Tobias Springenberg, Michael Equi, Ali Amin, Adnan Esmail, Sergey Levine, Liyiming Ke
- Year
- 2026
- Access
- Open access
Abstract
Vision-language-action (VLA) models can learn to perform diverse manipulation skills "out of the box," but achieving the precision and speed that real-world tasks demand requires further fine-tuning -- for example, via reinforcement learning (RL). We introduce a lightweight method that enables sample-efficient online RL fine-tuning of pretrained VLAs using just a few hours of real-world practice. We (1) adapt the VLA to expose an "RL token," a compact readout representation that preserves task-relevant pretrained knowledge while serving as an efficient interface for online RL, and (2) train a small actor-critic head on this RL token to refine the actions, while anchoring the learned policy to the VLA. Online RL with the RL token (RLT) makes it possible to fine-tune even large VLAs with RL quickly and efficiently. Across four real-robot tasks (screw installation, zip tie fastening, charger insertion, and Ethernet insertion), RLT improves the speed on the hardest part of the task by up to 3x and raises success rates significantly within minutes to a few hours of practice. It can even surpass the speed of human teleoperation on some of the tasks.
Keywords
Related papers
The Uncanny Valley [From the Field]
Masahiro Mori, Karl F. MacDorman, Norri Kageki
2012
Measurement Instruments for the Anthropomorphism, Animacy, Likeability, Perceived Intelligence, and Perceived Safety of Robots
Christoph Bartneck, Dana Kulić, Elizabeth A. Croft +1 more
2008
The development of Honda humanoid robot
Kazuo Hirai, Masato Hirose, Y. Haikawa +1 more
2002
A Meta-Analysis of Factors Affecting Trust in Human-Robot Interaction
Peter A. Hancock, Deborah R. Billings, Kristin E. Schaefer +3 more
2011