VST-LLM HRI: Multimodal Human-Robot Interaction via Large Language Model Prompts
Weikai Ding, Shijun Xiao, Zhengguo Zhu, Teng Chen, Guoteng Zhang
- Year
- 2025
- Citations
- 2
Abstract
This paper proposes a Visual-Speech-Text Large Language Model framework for Human-Robot Interaction (VSTLLM HRI). By designing a Modality Language Model (MLM), the framework achieves a closed-loop system for robot perception, task planning, and control. Without requiring fine-tuning of the Large Language Model (LLM), the framework leverages visual semantic extraction, speech command conversion, and prompt engineering guidance to accomplish tasks. We conducted experiments on a bipedal robot to validate the adaptability and control performance of the framework in complex terrain task scenarios. The experimental results demonstrated that the proposed method exhibited good generalization capabilities. The related project files and programs have been uploaded to https://github.com/dwk-Suga/LLMandVLM.git.
Keywords
Related papers
Statistical Learning Theory
Yuhai Wu, Vladimir Vapnik
1999
Artificial intelligence: a modern approach
1995
A new optimizer using particle swarm theory
R.C. Eberhart, James Kennedy
2002
Self-Organizing Maps
Teuvo Kohonen
1995