HARP-VLA: Human-Robot Aligned Representation Learning for Vision-Language-Action Model
Xiang Zhu, Puzhen Yuan, Yichen Liu, Jianyu Chen
- 发表年份
- 2026
- 访问权限
- 开放获取
摘要
Learning generalizable vision-language-action (VLA) models from large-scale human videos is promising but challenging due to cross-embodiment discrepancies in both visual observations and executable actions. While latent action models reduce the action execution gap by learning action abstractions, they still rely on visual features. Thus, misaligned human and robot visual representations can lead to inconsistencies in policy inputs and induce domain-dependent latent actions, hindering effective co-training with human videos. To address this, we propose HARP, a human-robot aligned representation learning framework for more effective VLA pretraining from human videos. Specifically, HARP uses limited paired human-robot demonstrations as cross-embodiment bridges and abundant unpaired human and robot videos as a scalable dynamics supervision data source. It trains a robot-adapted visual encoder and a latent action model with manipulation-centric auxiliary cues and a source-relative pair-discriminative alignment loss, which adapts robot representations toward human semantics while preserving pair-level discrimination. The learned aligned vision encoder and latent action model provide a unified vision and action representation for VLA-style policy learning, where human and robot videos provide vision-language-to-latent-action supervision and a lightweight robot action head grounds latent actions into executable commands. Experiments on feature visualization, simulation, and realworld manipulation show improved human-robot alignment and downstream policy performance, achieving 4.481 average length on CALVIN ABC$\rightarrow$D and a 7.1\% realworld success rate gain over the strongest baseline.
关键词
相关论文
The Uncanny Valley [From the Field]
Masahiro Mori, Karl F. MacDorman, Norri Kageki
2012
Measurement Instruments for the Anthropomorphism, Animacy, Likeability, Perceived Intelligence, and Perceived Safety of Robots
Christoph Bartneck, Dana Kulić, Elizabeth A. Croft 等 4 位作者
2008
The development of Honda humanoid robot
Kazuo Hirai, Masato Hirose, Y. Haikawa 等 4 位作者
2002
A Meta-Analysis of Factors Affecting Trust in Human-Robot Interaction
Peter A. Hancock, Deborah R. Billings, Kristin E. Schaefer 等 6 位作者
2011