2025-08-15
This paper proposes a new paradigm, VisionThink, to solve the trade-off between computational cost and performance in Vision-Language Models (VLMs). This method, realized through reinforcement learning, first processes low-resolution images and has the model itself decide whether to request a high-resolution image when necessary. It successfully maintains performance on tasks requiring high-definition information, such as OCR, while significantly reducing computational load on general tasks.
2025-08-13
This paper presents a large-scale, systematic comparison and analysis of the performance of Transformer models, which have recently gained attention, and conventional mainstream CNN models for object detection tasks using remote sensing data such as satellite imagery. The study evaluates 11 different models on three datasets with distinct characteristics, revealing the potential for Transformers to outperform CNNs and clarifying the trade-off with the associated training costs.
2025-08-13
This study investigates whether deep learning models can predict a patient's self-reported race from skin histology images, examining the potential for demographic biases that AI may unintentionally learn. Through attention analysis, it reveals that the model uses specific tissue structures like the 'epidermis' as cues (shortcuts) to predict race. These findings highlight the importance of data management and bias mitigation for the fair implementation of medical AI in society.
2025-08-13
This paper proposes a new benchmark, 'Document Haystack,' which measures the ability to find specific information from long documents up to 200 pages long. This benchmark evaluates how accurately a Vision Language Model (VLM) can find intentionally embedded text or image information ('needles') within a document. The experimental results reveal that while current VLMs perform well on text-only documents, their performance significantly degrades on imaged documents or when handling information that combines text and images. This highlights future research challenges in the long-context and multimodal document understanding capabilities of VLMs.
2025-08-13
This paper proposes BF-PIP, a zero-shot method that utilizes Google's Gemini 2.5 Pro to predict pedestrian crossing intention without any additional training. Unlike conventional frame-based methods, it directly uses short, continuous videos and metadata such as ego-vehicle speed, achieving a high accuracy of 73% and demonstrating the potential for robust intention prediction based on contextual understanding.
2025-08-08
This paper proposes a novel method called Query-Guided Activation Refilling (ACRE) aimed at improving efficiency and performance in processing long contexts within Large Language Models (LLMs). By combining a two-layer KV cache and query-guided refilling, the approach enables processing of contexts beyond the native context window, significantly enhancing the practicality of long-context information retrieval.
2025-08-08
This paper proposes a novel method, "Desiview," for automatically identifying desirable review comments (DRC) that lead to code changes in code reviews. By constructing a high-quality dataset using Desiview and fine-tuning and aligning the LLaMA model, we demonstrate a significant improvement in DRC generation capability. This approach is expected to greatly contribute to code review automation and software development support.
2025-08-08
This paper proposes a hybrid Top-k recommendation system that combines traditional recommendation methods with large language models (LLMs). Users are categorized as "active users" and "weak users," with LLMs employed to improve recommendation accuracy and fairness for the latter group. At the same time, the model controls LLM computational costs to ensure practical feasibility.