SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model
Guankun Wang, Junyi Wang, Wenjin Mo, Long Bai, Kun Yuan, Ming Hu, Jinlin Wu, Junjun He, Yiming Huang, Nicolas Padoy, Zhen Lei, Hongbin Liu, Nassir Navab, Hongliang Ren
- Year
- 2025
- Access
- Open access
Abstract
Surgical scene understanding is critical for surgical training and robotic decision-making in robot-assisted surgery. Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated great potential for advancing scene perception in the medical domain, facilitating surgeons to understand surgical scenes and procedures. However, these methods are primarily oriented towards image-based analysis or global video understanding, overlooking the fine-grained video reasoning that is crucial for analyzing specific processes and capturing detailed task execution within a surgical procedure. To bridge this gap, we propose SurgVidLM, the first video language model designed to address both full and fine-grained surgical video comprehension. To train our SurgVidLM, we construct the SVU-31K that is a large-scale dataset with over 31K video-instruction pairs, enabling both holistic understanding and detailed analysis of surgical procedures. Building on this resource, SurgVidLM incorporates a two-stage StageFocus mechanism: the first stage extracts global procedural context, while the second stage performs high-frequency local analysis guided by temporal cues. We also develop the Multi-frequency Fusion Attention to effectively integrate low- and high-frequency visual tokens, ensuring the preservation of critical task-specific details. Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs of comparable parameter scale in both full and fine-grained video understanding tasks, showcasing its superior capability in capturing the context of complex robot-assisted surgeries. Our code and dataset will be publicly accessible soon.
Keywords
Related papers
Campbell-Walsh urology
Alan J. Wein editor-in-chief
2012
Principles of Robot Motion: Theory, Algorithms, and Implementations
Howie Choset, Jean‐Claude Latombe
2005
Minimally Invasive versus Abdominal Radical Hysterectomy for Cervical Cancer
Pedro T. Ramírez, Michael Frumovitz, René Pareja +16 more
2018
Guideline for Management of the Clinical T1 Renal Mass
Steven C. Campbell, Andrew C. Novick, Arie S. Belldegrun +9 more
2009