Jumtra Blog

Latest Articles

Check out our most recent posts.

VisionThink: Smart and Efficient Vision Language Models via Reinforcement Learning

2025-08-15

New

This paper proposes a new paradigm, VisionThink, to solve the trade-off between computational cost and performance in Vision-Language Models (VLMs). This method, realized through reinforcement learning, first processes low-resolution images and has the model itself decide whether to request a high-resolution image when necessary. It successfully maintains performance on tasks requiring high-definition information, such as OCR, while significantly reducing computational load on general tasks.

VLM

Reinforcemen...

Model Effici...

Performance Evaluation and Analysis of Deep Learning Transformers and CNNs on Modern Remote Sensing Datasets

2025-08-13

New

This paper presents a large-scale, systematic comparison and analysis of the performance of Transformer models, which have recently gained attention, and conventional mainstream CNN models for object detection tasks using remote sensing data such as satellite imagery. The study evaluates 11 different models on three datasets with distinct characteristics, revealing the potential for Transformers to outperform CNNs and clarifying the trade-off with the associated training costs.

Remote Sensing

Object Detec...

Transformer

Can Deep Learning Predict Race from Skin Histology Images? A New Warning for AI Fairness

2025-08-13

New

This study investigates whether deep learning models can predict a patient's self-reported race from skin histology images, examining the potential for demographic biases that AI may unintentionally learn. Through attention analysis, it reveals that the model uses specific tissue structures like the 'epidermis' as cues (shortcuts) to predict race. These findings highlight the importance of data management and bias mitigation for the fair implementation of medical AI in society.

AI Fairness

Computationa...

Deep Learning

Document Haystack: A Vision LLM Benchmark for Multimodal Document Understanding in Long Contexts

2025-08-13

New

This paper proposes a new benchmark, 'Document Haystack,' which measures the ability to find specific information from long documents up to 200 pages long. This benchmark evaluates how accurately a Vision Language Model (VLM) can find intentionally embedded text or image information ('needles') within a document. The experimental results reveal that while current VLMs perform well on text-only documents, their performance significantly degrades on imaged documents or when handling information that combines text and images. This highlights future research challenges in the long-context and multimodal document understanding capabilities of VLMs.

Vision Langu...

Benchmark

Seeing Beyond Frames: Zero-Shot Pedestrian Intention Prediction with Raw Temporal Video and Multimodal Cues

2025-08-13

New

This paper proposes BF-PIP, a zero-shot method that utilizes Google's Gemini 2.5 Pro to predict pedestrian crossing intention without any additional training. Unlike conventional frame-based methods, it directly uses short, continuous videos and metadata such as ego-vehicle speed, achieving a high accuracy of 73% and demonstrating the potential for robust intention prediction based on contextual understanding.

Autonomous D...

Multimodal LLM

Zero-shot Le...

Latest Articles

Recommended Articles