Citation: http://arxiv.org/pdf/2507.21161v1
This paper, "Seeing Beyond Frames: Zero-Shot Pedestrian Intention Prediction with Raw Temporal Video and Multimodal Cues," focuses on a critical challenge in autonomous driving technology: predicting pedestrian crossing intention. To address this challenge, the study proposes a new zero-shot approach, "BF-PIP," which leverages Google's latest Multimodal Large Language Model (MLLM), Gemini 2.5 Pro.
Research Objective: The main objective of this research is to overcome the problems inherent in conventional intention prediction models, such as the need for large amounts of pre-training data and low adaptability to new environments. To achieve this, the study aims to build a framework that directly and accurately predicts pedestrian intention from continuous video footage and multiple information sources (multimodal cues) through "zero-shot learning," without any additional training.
Research Background: For autonomous vehicles to navigate urban areas safely, it is essential to accurately predict the next actions of pedestrians, especially whether they will cross the road. Previous research has used models like RNNs and Transformers to predict pedestrian movement, but these methods require training on specific datasets and have difficulty responding to unknown situations not present in the training data. In recent years, MLLMs like GPT-4V have emerged, making zero-shot prediction increasingly possible. However, these still process sequences of still images (frame sequences) and may miss the subtle nuances that can only be captured in continuous video, such as a pedestrian's "hesitation" or "gaze movement."
Highlights of the Proposed Method: The most important features of the "BF-PIP" method proposed in this study are as follows:
Figure 1: Overview diagram of the BF-PIP framework. Multimodal information, such as a short video clip, bounding box, and ego-vehicle speed, is input as a prompt to Gemini 2.5 Pro, which predicts the pedestrian's crossing intention (Crossing/Not Crossing) in a zero-shot manner.
This study takes a step beyond previous MLLM-based methods, with the key difference being that it directly inputs raw continuous video clips into the model. This allows it to capture temporal dynamics, such as hesitation and gaze shifts, that were not fully captured by prior research, enabling predictions based on a more realistic situational awareness.
This research offers a new perspective compared to existing studies in the following ways:
This research makes the following academic and practical contributions:
This paper proposes BF-PIP (Beyond Frames Pedestrian Intention Prediction). This method fully leverages the advanced multimodal capabilities of Gemini 2.5 Pro, which can process video, images, and text in a single prompt.
The proposed method consists of the following steps:
The following evaluation methods were adopted to verify the effectiveness of the proposed approach:
JAADbeh
subset of the JAAD (Joint Attention in Autonomous Driving) dataset, widely used in autonomous driving research and specifically focused on crossing behaviors, was used.The study yielded the following important findings:
Quantitative Results Despite no additional training, BF-PIP demonstrated exceptionally high performance compared to existing specialized models and MLLM-based methods.
Models | Year | Model Variants | Inputs | JAAD-beh |
---|---|---|---|---|
Models | Year | Model Variants | I B P S V Extra Info. | ACC AUC F1 P R |
MultiRNN [3] | 2018 | GRU | ✓ ✓ ✓ – – | 0.61 0.50 0.74 0.64 0.86 |
... | ... | ... | ... | ... |
GPT4V-PBP [15] | 2023 | MLLM | ✓ ✓ – – – Text | 0.57 0.61 0.65 0.82 0.54 |
OmniPredict [14] | 2024 | MLLM | ✓ ✓ – ✓ – Text | 0.67 0.65 0.65 0.66 0.65 |
BF-PIP(Ours) | 2025 | MLLM | – ✓ – ✓ ✓ Text | 0.73 0.77 0.80 0.96 0.69 |
Table 1: Performance comparison with existing state-of-the-art methods. BF-PIP (in bold) uses video (V) as a primary input and achieves high performance in accuracy (ACC), AUC, F1-score, and Precision (P).
Qualitative Results Analysis of how the model makes its judgments revealed that Gemini 2.5 Pro understands context deeply, much like a human.
Figure 2: Example of qualitative analysis of pedestrian crossing intention. The model captures multiple factors such as the pedestrian's posture (leaning forward), direction of gaze (checking traffic), and subtle movements (a step towards the crosswalk) to make a comprehensive judgment of the crossing intention.
Ablation Study By varying the types of input information, the study investigated which elements contributed to performance.
Input Modality | ACC | AUC | F1 | P | R |
---|---|---|---|---|---|
UV (Unannotated Video) | 0.65 | 0.62 | 0.74 | 0.96 | 0.60 |
UV + S (+ Speed) | 0.70 | 0.74 | 0.78 | 0.97 | 0.65 |
AV (Annotated Video) | 0.64 | 0.61 | 0.73 | 0.95 | 0.59 |
AV + S (+ Speed) | 0.73 | 0.76 | 0.80 | 0.96 | 0.69 |
Table 2: Ablation study on input modalities. The combination of annotated video (AV) with ego-vehicle speed (S) showed the highest performance.
The results of this research are expected to have applications in various fields.
The outcomes of this research are expected to be utilized in the following business areas:
Although this research has achieved significant success, a future challenge is handling more complex scenarios. Examples include situations with multiple pedestrians present simultaneously, performance verification under adverse conditions like bad weather or nighttime, and optimization of computational costs to ensure real-time performance.
This paper proposed a new framework, "BF-PIP," which utilizes the multimodal capabilities of Gemini 2.5 Pro to predict pedestrian crossing intention from raw continuous video clips in a zero-shot manner. By achieving higher accuracy than existing state-of-the-art methods without any additional training, this study has moved beyond the analysis of static frames and demonstrated the importance of richly capturing temporal context. This achievement is a significant step towards realizing safer and more efficient autonomous driving systems and is expected to have a major impact on future AI development.
2025-08-13
This paper presents a large-scale, systematic comparison and analysis of the performance of Transformer models, which have recently gained attention, and conventional mainstream CNN models for object detection tasks using remote sensing data such as satellite imagery. The study evaluates 11 different models on three datasets with distinct characteristics, revealing the potential for Transformers to outperform CNNs and clarifying the trade-off with the associated training costs.
2025-08-13
This study investigates whether deep learning models can predict a patient's self-reported race from skin histology images, examining the potential for demographic biases that AI may unintentionally learn. Through attention analysis, it reveals that the model uses specific tissue structures like the 'epidermis' as cues (shortcuts) to predict race. These findings highlight the importance of data management and bias mitigation for the fair implementation of medical AI in society.
2025-08-13
This paper proposes a new benchmark, 'Document Haystack,' which measures the ability to find specific information from long documents up to 200 pages long. This benchmark evaluates how accurately a Vision Language Model (VLM) can find intentionally embedded text or image information ('needles') within a document. The experimental results reveal that while current VLMs perform well on text-only documents, their performance significantly degrades on imaged documents or when handling information that combines text and images. This highlights future research challenges in the long-context and multimodal document understanding capabilities of VLMs.