Home /Research /MIST: Multimodal emotion recognition using DeBERTa for text, Semi-CNN for speech, ResNet-50 for facial, and 3D-CNN for motion analysis
HRI

MIST: Multimodal emotion recognition using DeBERTa for text, Semi-CNN for speech, ResNet-50 for facial, and 3D-CNN for motion analysis

Enguerrand Boitel, Alaa Mohasseb, Ella Haig

Year
2025
Citations
32

Abstract

Human emotion recognition is a rapidly evolving field in artificial intelligence, crucial for improving human–computer interaction. This paper introduces the MIST (Motion, Image, Speech, and Text) framework, a novel multimodal approach to emotion recognition that integrates diverse data modalities. Unlike existing models focusing on unimodal analysis, MIST leverages the complementary strengths of text (using DeBERTa), speech (using Semi-CNN), facial (using ResNet-50), and motion (using 3D-CNN) data to enhance accuracy and reliability. Our evaluation, conducted on the BAUM-1 and SAVEE datasets, demonstrates that MIST significantly outperforms traditional unimodal and some multimodal approaches in emotion recognition tasks. This research advances the field by providing a better understanding of emotional states, with potential applications in social robots, personal assistants, and educational technologies.

Keywords

Computer scienceSpeech recognitionEmotion recognitionResidual neural networkArtificial intelligenceMotion (physics)Pattern recognition (psychology)Deep learning

Related papers

Browse all HRI papers