Home /Research /VST-LLM HRI: Multimodal Human-Robot Interaction via Large Language Model Prompts
HRI

VST-LLM HRI: Multimodal Human-Robot Interaction via Large Language Model Prompts

Weikai Ding, Shijun Xiao, Zhengguo Zhu, Teng Chen, Guoteng Zhang

Year
2025
Citations
2

Abstract

This paper proposes a Visual-Speech-Text Large Language Model framework for Human-Robot Interaction (VSTLLM HRI). By designing a Modality Language Model (MLM), the framework achieves a closed-loop system for robot perception, task planning, and control. Without requiring fine-tuning of the Large Language Model (LLM), the framework leverages visual semantic extraction, speech command conversion, and prompt engineering guidance to accomplish tasks. We conducted experiments on a bipedal robot to validate the adaptability and control performance of the framework in complex terrain task scenarios. The experimental results demonstrated that the proposed method exhibited good generalization capabilities. The related project files and programs have been uploaded to https://github.com/dwk-Suga/LLMandVLM.git.

Keywords

Task (project management)RobotModality (human–computer interaction)GeneralizationTask analysisLanguage understandingAdaptabilityTerrainNatural language

Related papers

Browse all HRI papers