Home /Research /Quart-Online: Latency-Free Multimodal Large Language Model for Quadruped Robot Learning

LOCOMOTION

Quart-Online: Latency-Free Multimodal Large Language Model for Quadruped Robot Learning

Xinyang Tong, Pengxiang Ding, Y. L. Fan, Donglin Wang, Wenjie Zhang, Can Cui, Mingyang Sun, Han Zhao, Hongyin Zhang, Yonghao Dang, Siteng Huang, Shangke Lyu

Year: 2025
Citations: 2

Abstract

This paper addresses the inherent inference latency challenges associated with deploying multimodal large language models (MLLM) in quadruped vision-language-action (QUAR-VLA) tasks. Our investigation reveals that conventional parameter reduction techniques ultimately impair the performance of the language foundation model during the action instruction tuning phase, making them unsuitable for this purpose. We introduce a novel latency-free quadruped MLLM model, dubbed QUARTOnline, designed to enhance inference efficiency without degrading the performance of the language foundation model. By incorporating Action Chunk Discretization (ACD), we compress the original action representation space, mapping continuous action values onto a smaller set of discrete representative vectors while preserving critical information. Subsequently, we fine-tune the MLLM to integrate vision, language, and compressed actions into a unified semantic space. Experimental results demonstrate that QUART-Online operates in tandem with the existing MLLM system, achieving real-time inference at 50 Hz in sync with the underlying controller frequency, significantly boosting the success rate across various tasks by 65 %. Our project page is https://quart-online.github.io.

Keywords

Computer scienceLatency (audio)RobotHuman–computer interactionArtificial intelligenceTelecommunications

Quart-Online: Latency-Free Multimodal Large Language Model for Quadruped Robot Learning

Abstract

Keywords

Related papers

Statistical Learning Theory

Artificial intelligence: a modern approach

Applied Nonlinear Control

A new optimizer using particle swarm theory