首页 /研究 /M3PT: A Transformer for Multimodal, Multi-Party Social Signal Prediction with Person-aware Blockwise Attention

HRI

M3PT: A Transformer for Multimodal, Multi-Party Social Signal Prediction with Person-aware Blockwise Attention

Yiming Tang, Abrar Anwar, Jesse Thomason

发表年份: 2025
访问权限: 开放获取

摘要

Understanding social signals in multi-party conversations is important for human-robot interaction and artificial social intelligence. Social signals include body pose, head pose, speech, and context-specific activities like acquiring and taking bites of food when dining. Past work in multi-party interaction tends to build task-specific models for predicting social signals. In this work, we address the challenge of predicting multimodal social signals in multi-party settings in a single model. We introduce M3PT, a causal transformer architecture with modality and temporal blockwise attention masking to simultaneously process multiple social cues across multiple participants and their temporal interactions. We train and evaluate M3PT on the Human-Human Commensality Dataset (HHCD), and demonstrate that using multiple modalities improves bite timing and speaking status prediction. Source code: https://github.com/AbrarAnwar/masked-social-signals/.

关键词

cs.LGcs.AIcs.RO

M3PT: A Transformer for Multimodal, Multi-Party Social Signal Prediction with Person-aware Blockwise Attention

摘要

关键词

相关论文

The Uncanny Valley [From the Field]

Measurement Instruments for the Anthropomorphism, Animacy, Likeability, Perceived Intelligence, and Perceived Safety of Robots

The development of Honda humanoid robot

A Meta-Analysis of Factors Affecting Trust in Human-Robot Interaction