Home /Research /I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots

HRI

I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots

Giulio Antonio Abbo, Tony Belpaeme

Year: 2025
Citations: 7

Abstract

In the rapidly evolving landscape of human-robot interaction, the integration of vision capabilities into conversational agents stands as a crucial advancement. This paper presents a ready-to-use implementation of a dialogue manager that leverages the latest progress in Large Language Models (e.g., GPT-4o mini) to enhance the traditional text-based prompts with real-time visual input. LLMs are used to interpret both textual prompts and visual stimuli, creating a more contextually aware conversational agent. The system's prompt engineering, incorporating dialogue with summarisation of the images, en-sures a balance between context preservation and computational efficiency. Six interactions with a Furhat robot powered by this system are reported, illustrating and discussing the results obtained. The system can be customised and is available as a stand-alone application, a Furhat robot implementation, and a ROS2 package.

Keywords

RobotComputer scienceHuman–computer interactionSocial robotArtificial intelligenceComputer visionMobile robotRobot control

I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots

Abstract

Keywords

Related papers

Statistical Learning Theory

Artificial intelligence: a modern approach

Applied Nonlinear Control

A new optimizer using particle swarm theory