Reference: http://arxiv.org/pdf/2507.15882v2
This paper, "Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark," is a study focusing on a significant challenge in the AI field: evaluating the understanding capabilities of Vision Language Models (VLMs) for long documents. This research proposes a new evaluation standard (benchmark) called "Document Haystack" to address this challenge and verifies its effectiveness.
Research Objective: The main objective of this study is to objectively measure how accurately recent VLMs can find specific information from long and visually complex documents, ranging from tens to hundreds of pages. By doing so, it aims to clarify the capabilities and limitations of existing models and to indicate the direction for future research and development.
Research Background: With the advent of VLMs like GPT-4 and Gemini, AI has become capable of handling complex tasks that combine images and text. In particular, the ability to understand specialized documents such as contracts, financial reports, and medical records holds the potential to dramatically improve business efficiency in many fields. However, many existing evaluation metrics focus on short documents or single tasks, and it was not well understood how accurately VLMs could process the "long and complex documents" they would encounter in the real world.
Highlights of the Proposed Method: The most significant feature of the "Document Haystack" proposed in this study is its application of the "Needle in a Haystack" concept to long, multimodal documents. It tests the information retrieval capabilities of a VLM by embedding specific information (the needle) within a document of up to 200 pages (the haystack) and having the VLM find it. This "needle" comes in two types—one with only text and one combining text and an image—allowing for a multifaceted evaluation of the model's abilities.
Figure 1: An example of a "text needle." The text "The secret sport is 'Basketball'" is embedded within a document page.
Figure 2: An example of a "text+image needle." Following the text "The secret sport is," the answer, "Basketball," is shown as an image.
The evaluation of VLMs has been conducted from various perspectives.
Existing benchmarks had several gaps:
"Document Haystack" overcomes these challenges by handling extensive documents up to 200 pages, providing formats close to the original document (PDFs or page-by-page images), and enabling performance comparison while varying the document length.
This research offers new perspectives compared to existing studies in the following ways:
text needle
) and multimodal information combining text and images (text+image needle
).This research makes the following academic and practical contributions:
This paper proposes a new benchmark, "Document Haystack," to measure the long-document reading comprehension ability of VLMs. This method tests a simple yet fundamental capability: finding a single piece of important information (the needle) from a vast amount of information (the haystack).
The proposed benchmark is constructed through the following steps:
# Pages | 5 | 10 | 25 | 50 | 75 | 100 | 150 | 200 | Total |
---|---|---|---|---|---|---|---|---|---|
text needles | |||||||||
# Documents | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 200 |
# Questions | 125 | 250 | 625 | 625 | 625 | 625 | 625 | 625 | 4125 |
text+image needles | |||||||||
# Documents | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 200 |
# Questions | 125 | 250 | 625 | 625 | 625 | 625 | 625 | 625 | 4125 |
Total | |||||||||
Total # Documents | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 400 |
Total # Questions | 250 | 500 | 1250 | 1250 | 1250 | 1250 | 1250 | 1250 | 8250 |
Table 1: Composition of Document Haystack. The number of documents and questions are defined for each document length and needle type, totaling 400 document variations and 8250 questions.
"Needle" Design and Embedding: Two types of "needles" are embedded at various depths (page positions) within the documents.
Evaluation Method: The VLM is asked, "What is the secret KEY in this document?" The model's response is automatically checked to see if it contains the correct VALUE, and the accuracy is calculated.
To validate the effectiveness of the proposed benchmark and measure the capabilities of current VLMs, the following evaluations were conducted:
The study yielded the following important findings:
Model | #Pages 5 | 10 | 25 | 50 | 75 | 100 | 150 | 200 |
---|---|---|---|---|---|---|---|---|
Nova Lite | 100.0 | 98.8 | 85.0 | 76.6 | 72.5 | 69.6 | 64.5 | 62.9 |
Gemini Flash-2.0 | 83.2 | 74.8 | 82.7 | 64.0 | 63.2 | 58.4 | 46.9 | 51.8 |
GPT-4o-mini | 96.0 | 98.0 | 89.3 | 86.1 | - | - | - | - |
Table 5: Accuracy (%) for Text Extraction from Images. Accuracy decreases as documents get longer. Nova Lite and GPT-4o-mini show high performance. |
Model | #Pages 5 | 10 | 25 | 50 | 75 | 100 | 150 | 200 |
---|---|---|---|---|---|---|---|---|
Nova Lite | 100.0 | 100.0 | 98.9 | 95.2 | 94.6 | 93.9 | 94.1 | 89.9 |
Gemini Flash-2.0 | 99.2 | 99.6 | 99.5 | 97.8 | 96.8 | 97.1 | 91.5 | 91.8 |
GPT-4o-mini | 100.0 | 100.0 | 97.9 | 98.4 | 96.6 | 97.5 | - | - |
Table 6: Accuracy (%) for Text Extraction from Parsed Text. All models show very high performance. |
Model | #Pages 5 | 10 | 25 | 50 | 75 | 100 | 150 | 200 |
---|---|---|---|---|---|---|---|---|
Nova Lite | 84.0 | 84.0 | 61.4 | 52.2 | 43.5 | 38.9 | 34.9 | 37.0 |
Gemini Flash-2.0 | 53.6 | 52.0 | 67.4 | 56.8 | 48.6 | 43.5 | 37.9 | 38.7 |
GPT-4o-mini | 43.2 | 36.4 | 39.4 | 26.9 | - | - | - | - |
Table 7: Accuracy (%) for Text + Image Extraction from Images. Performance drops significantly for all models, showing the difficulty of multimodal understanding. |
From these findings, it was confirmed that the proposed method is effective in clearly highlighting the current strengths and weaknesses of VLMs in long-context, multimodal understanding.
The results of this study and the proposed benchmark are expected to have applications in the following areas:
The outcomes of this research are expected to be utilized in the following business areas:
Future research will need to address the important challenge of improving the ability of VLMs to maintain visual information over long contexts. Specifically, this will require the development of more efficient attention mechanisms and architectures that can integrate text and image information at a higher level.
This paper proposed "Document Haystack," a new and comprehensive benchmark for evaluating the long-context, multimodal document understanding capabilities of VLMs. Evaluations using this benchmark have revealed that even today's state-of-the-art VLMs face significant challenges in processing long visual documents. This research is expected to contribute significantly to the advancement of VLM research and to be a crucial step toward realizing more practical document-understanding AI.
2025-08-08
This paper proposes a hybrid Top-k recommendation system that combines traditional recommendation methods with large language models (LLMs). Users are categorized as "active users" and "weak users," with LLMs employed to improve recommendation accuracy and fairness for the latter group. At the same time, the model controls LLM computational costs to ensure practical feasibility.
2025-08-13
This paper presents a large-scale, systematic comparison and analysis of the performance of Transformer models, which have recently gained attention, and conventional mainstream CNN models for object detection tasks using remote sensing data such as satellite imagery. The study evaluates 11 different models on three datasets with distinct characteristics, revealing the potential for Transformers to outperform CNNs and clarifying the trade-off with the associated training costs.
2025-08-13
This study investigates whether deep learning models can predict a patient's self-reported race from skin histology images, examining the potential for demographic biases that AI may unintentionally learn. Through attention analysis, it reveals that the model uses specific tissue structures like the 'epidermis' as cues (shortcuts) to predict race. These findings highlight the importance of data management and bias mitigation for the fair implementation of medical AI in society.