VLM-VPI: A Vision-Language Reasoning Framework for Improving Automated Vehicle-Pedestrian Interactions
Qingwen Pu, Kun Xie, Yuxiang Liu
- Year
- 2026
- Access
- Open access
Abstract
Autonomous driving systems often infer pedestrian yielding behavior from geometric and kinematic cues alone, limiting their ability to reason about visual scene context and age-dependent behavioral variability. This limitation can produce delayed interventions in safety-critical encounters and unnecessary braking in benign interactions. This work introduces Vision-Language Model-based Vehicle-Pedestrian Interaction (VLM-VPI), a multimodal reasoning framework for pedestrian intent understanding and yielding-aware control in autonomous driving. The system combines three components: a multimodal perception layer that captures visual and kinematic observations, a reasoning layer that uses Qwen3-VL 8B for visual scene understanding and GPT-OSS 20B for few-shot intent reasoning, and a tiered safety controller that applies age-specific braking margins for children, adults, and seniors. In 112 CARLA scenarios, VLM-VPI achieves 92.3% intent classification accuracy, outperforming a rule-based baseline (78.4%), supervised trajectory models (73.5-82.4%), and a zero-shot LLM configuration (88.4%). Validation on 24 real-world PIE scenarios yields 87.5% accuracy, indicating functional sim-to-real transferability. Across 200 simulation cases, VLM-VPI reduces the false-alarm rate from 7.4% to 2.8% and mean intersection traversal time from 13.5 s to 11.8 s. Conflict occurrences decrease from 124 to 33, while mean minimum time-to-collision improves from 1.92 s to 4.47 s. Demographic-adaptive control further reduces conflicts by 60% for children and 54.5% for seniors compared with uniform control. These results show that an explicit vision-language reasoning layer can improve both safety and efficiency by linking pedestrian intent, demographic context, and vehicle control decisions.
Keywords
Related papers
Artificial intelligence: a modern approach
1995
Are we ready for autonomous driving? The KITTI vision benchmark suite
Andreas Geiger, P Lenz, R. Urtasun
2012
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Martı́n Abadi, Ashish Agarwal, Paul Barham +17 more
2016
Vision meets robotics: The KITTI dataset
Andreas Geiger, Philip Lenz, Christoph Stiller +1 more
2013