Posts in category "Paper"

VisionThink: Smart and Efficient Vision Language Models via Reinforcement Learning

2025-08-15

New

This paper proposes a new paradigm, VisionThink, to solve the trade-off between computational cost and performance in Vision-Language Models (VLMs). This method, realized through reinforcement learning, first processes low-resolution images and has the model itself decide whether to request a high-resolution image when necessary. It successfully maintains performance on tasks requiring high-definition information, such as OCR, while significantly reducing computational load on general tasks.

VLM

Reinforcemen...

Model Effici...

Performance Evaluation and Analysis of Deep Learning Transformers and CNNs on Modern Remote Sensing Datasets

2025-08-13

New

This paper presents a large-scale, systematic comparison and analysis of the performance of Transformer models, which have recently gained attention, and conventional mainstream CNN models for object detection tasks using remote sensing data such as satellite imagery. The study evaluates 11 different models on three datasets with distinct characteristics, revealing the potential for Transformers to outperform CNNs and clarifying the trade-off with the associated training costs.

Remote Sensing

Object Detec...

Transformer

Can Deep Learning Predict Race from Skin Histology Images? A New Warning for AI Fairness

2025-08-13

New

This study investigates whether deep learning models can predict a patient's self-reported race from skin histology images, examining the potential for demographic biases that AI may unintentionally learn. Through attention analysis, it reveals that the model uses specific tissue structures like the 'epidermis' as cues (shortcuts) to predict race. These findings highlight the importance of data management and bias mitigation for the fair implementation of medical AI in society.

AI Fairness

Computationa...

Deep Learning

Document Haystack: A Vision LLM Benchmark for Multimodal Document Understanding in Long Contexts

2025-08-13

New

This paper proposes a new benchmark, 'Document Haystack,' which measures the ability to find specific information from long documents up to 200 pages long. This benchmark evaluates how accurately a Vision Language Model (VLM) can find intentionally embedded text or image information ('needles') within a document. The experimental results reveal that while current VLMs perform well on text-only documents, their performance significantly degrades on imaged documents or when handling information that combines text and images. This highlights future research challenges in the long-context and multimodal document understanding capabilities of VLMs.

Vision Langu...

Benchmark

Seeing Beyond Frames: Zero-Shot Pedestrian Intention Prediction with Raw Temporal Video and Multimodal Cues

2025-08-13

New

This paper proposes BF-PIP, a zero-shot method that utilizes Google's Gemini 2.5 Pro to predict pedestrian crossing intention without any additional training. Unlike conventional frame-based methods, it directly uses short, continuous videos and metadata such as ego-vehicle speed, achieving a high accuracy of 73% and demonstrating the potential for robust intention prediction based on contextual understanding.

Autonomous D...

Multimodal LLM

Zero-shot Le...

A Novel Approach to Dramatically Enhance LLM Long-Context Performance: Query-Guided ACRE

2025-08-08

This paper proposes a novel method called Query-Guided Activation Refilling (ACRE) aimed at improving efficiency and performance in processing long contexts within Large Language Models (LLMs). By combining a two-layer KV cache and query-guided refilling, the approach enables processing of contexts beyond the native context window, significantly enhancing the practicality of long-context information retrieval.

LLM

Long Context

Information ...

Desiview: A Method for Extracting Desirable Comments to Improve Code Review Quality

2025-08-08

This paper proposes a novel method, "Desiview," for automatically identifying desirable review comments (DRC) that lead to code changes in code reviews. By constructing a high-quality dataset using Desiview and fine-tuning and aligning the LLaMA model, we demonstrate a significant improvement in DRC generation capability. This approach is expected to greatly contribute to code review automation and software development support.

Software Eng...

Code Review

LLM

Robust and Fair Top-k Recommendations via Efficient and Responsible Adaptation of Large Language Models

2025-08-08

This paper proposes a hybrid Top-k recommendation system that combines traditional recommendation methods with large language models (LLMs). Users are categorized as "active users" and "weak users," with LLMs employed to improve recommendation accuracy and fairness for the latter group. At the same time, the model controls LLM computational costs to ensure practical feasibility.

Recommendati...

LLM