Personalized Speech Emotion Recognition in Human-Robot Interaction Using Vision Transformers
Ruchik Mishra, Andrew Frye, Madan M. Rayguru, Dan O. Popa
- 发表年份
- 2025
- 引用次数
- 4
摘要
Emotions are an essential element in human verbal communication, therefore it is important to understand individuals' affect during human-robot interaction (HRI). This letter investigates the application of vision transformer models, namely ViT (Vision Transformers) and BEiT (Bidirectional Encoder Representations from Pre-Training of Image Transformers) pipelines for Speech Emotion Recognition (SER) in HRI. The focus is to generalize the SER models for individual speech characteristics by fine-tuning these models on benchmark datasets and exploiting ensemble methods. For this purpose, we collected audio data from several human subjects having pseudo-naturalistic conversations with the NAO social robot. We then fine-tuned our ViT and BEiT-based models and tested these models on unseen speech samples from the participants in order to dentify four primary emotions from speech: neutral, happy, sad, and angry. The results show that fine-tuning vision transformers on benchmark datasets and then using either these already fine-tuned models or ensembling ViT/BEiT models results in higher classification accuracies than fine-tuning vanilla-ViTs or BEiTs.
关键词
相关论文
Statistical Learning Theory
Yuhai Wu, Vladimir Vapnik
1999
Artificial intelligence: a modern approach
1995
Applied Nonlinear Control
Jean-Jacques Slotine, Weiping Li
1991
A new optimizer using particle swarm theory
R.C. Eberhart, James Kennedy
2002