2025-08-15
This paper proposes a new paradigm, VisionThink, to solve the trade-off between computational cost and performance in Vision-Language Models (VLMs). This method, realized through reinforcement learning, first processes low-resolution images and has the model itself decide whether to request a high-resolution image when necessary. It successfully maintains performance on tasks requiring high-definition information, such as OCR, while significantly reducing computational load on general tasks.