Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment
Theodor Wulff, Federico Tavella, Rahul Singh Maharjan, Manith Adikari, Angelo Cangelosi
- Year
- 2026
- Access
- Open access
Abstract
Achieving robot transparency is a critical step toward effective human-robot collaboration. To be transparent, a robot's natural language communication must be consistent with its actions and explicitly grounded in the task and environment. Existing hierarchical Vision-Language-Action (VLA) models can generate language (e.g., through chain-of-thought) and low-level actions. However, current work does not consider explicit alignment between these modalities during training. To address this crucial gap, we propose a novel training framework that explicitly grounds hierarchical VLA sub-task descriptions with respect to the visual observation and action space. Our framework uses a contrastive model to assess the alignment between generated language and corresponding action trajectories. This contrastive model enables direct ranking of different language-trajectory pairs based on their alignment, allowing us to refine the grounding of our hierarchical VLA through offline preference learning. We apply our framework to the LanguageTable dataset, a benchmark dataset of human language-annotated trajectories, and provide critical insights into multimodal grounding representations, all while establishing a strong baseline that achieves performance comparable to fully supervised fine-tuning and minimizing the need for costly data annotations.
Keywords
Related papers
The Uncanny Valley [From the Field]
Masahiro Mori, Karl F. MacDorman, Norri Kageki
2012
Measurement Instruments for the Anthropomorphism, Animacy, Likeability, Perceived Intelligence, and Perceived Safety of Robots
Christoph Bartneck, Dana Kulić, Elizabeth A. Croft +1 more
2008
The development of Honda humanoid robot
Kazuo Hirai, Masato Hirose, Y. Haikawa +1 more
2002
A Meta-Analysis of Factors Affecting Trust in Human-Robot Interaction
Peter A. Hancock, Deborah R. Billings, Kristin E. Schaefer +3 more
2011