Citation: http://arxiv.org/pdf/2508.02871v1
This paper, "Evaluation and Analysis of Deep Neural Transformers and Convolutional Neural Networks on Modern Remote Sensing Datasets," is a study focusing on object detection from satellite images, a critical task in the field of remote sensing. This research systematically compares and analyzes the performance of the Transformer architecture, which has recently garnered significant attention in the computer vision field, and the Convolutional Neural Network (CNN), the conventional standard technology, within the specialized domain of satellite imagery.
Research Objective: The primary goal of this research is to clarify which is superior for object detection tasks in satellite imagery—Transformers or CNNs—and what their characteristics are. It also aims to elucidate the trade-off between performance and computational cost (training time) to provide practical guidance for researchers and developers in the remote sensing field to select models suited to their objectives.
Research Background: Since the advent of AlexNet in 2012, CNNs have dominated the field of image recognition. However, in recent years, the Transformer architecture, originating from the natural language processing field, has been applied to image recognition (e.g., Vision Transformer) and has begun to show performance surpassing CNNs in many tasks. However, many of these achievements were obtained on general ground-level photo datasets. Satellite images possess different characteristics from ground-level photos, such as a top-down perspective, diverse scales, and unique object arrangements. Therefore, it was not clear whether Transformers could demonstrate similar superiority for satellite imagery. This study conducted a large-scale comparative experiment to fill that gap.
Highlights of the Study: This study's value lies not in proposing a specific new technology but in its thorough comparative evaluation. The highlights are as follows:
Figure 1: Sample images from the three datasets used for evaluation in this study. From top to bottom: RarePlanes (aircraft), DOTA (diverse objects), and xView (high-density objects). It is clear that the type, size, and density of objects differ significantly across datasets.
In this study, representative models from the two major architectures in object detection, CNN and Transformer, were selected.
CNN-based Models (6 types):
Transformer-based Models (5 types):
Detector | Type | Backbone | Parameters (M) | APCOCO | Release Year |
---|---|---|---|---|---|
ConvNeXt SSD RetinaNet FCOS YOLOv3 YOLOX | Two-Stage CNN Single-Stage CNN Single-Stage CNN Single-Stage CNN Single-Stage CNN Single-Stage CNN | ConvNeXt-S VGG-16 ResNeXt-101 ResNeXt-101 DarkNet-53 YOLOX-X | 67.09 36.04 95.47 89.79 61.95 99.07 | 51.81 29.5 41.6 42.6 33.7 50.9 | 2022 2016 2017 2019 2018 2021 |
ViT DETR Deformable DETR SWIN CO-DETR | Transformer Transformer Transformer Transformer Transformer | ViT-B ResNet-50 ResNet-50 SWIN-T SWIN-L | 97.62 41.30 40.94 45.15 218.00 | N/A2 40.1 46.8 46.0 64.1 | 2020 2020 2020 2021 2023 |
Table 1: A comparison of the detection methods investigated in this study, summarizing their type, backbone (feature extractor), number of parameters, performance on the COCO dataset (AP), and release year.
Unlike prior research that develops individual models, this study is significantly different in that it transversely evaluates and analyzes these diverse models in the specific application domain of remote sensing. This approach clarifies the behavior of each model against challenges unique to satellite imagery, which might not be apparent on general datasets.
This study offers a new perspective compared to existing research in the following ways:
This study makes the following academic and practical contributions:
The following evaluation methods were adopted to verify the effectiveness of the proposed approaches.
The study yielded the following important findings.
1. Transformers Are Also Strong in Remote Sensing Across all datasets, the best-performing models were Transformer-based.
Model | P | F1 | AP | AP50 | AR | AR50 |
---|---|---|---|---|---|---|
SWIN | 45 | 81.70 | 59.04 | 73.71 | 61.94 | 74.47 |
YOLOX | 99 | 77.14 | 54.84 | 66.27 | 58.22 | 68.71 |
CO-DETR | 218 | 70.71 | 56.60 | 67.95 | 79.74 | 97.59 |
Table 3 (Partial Excerpt): Performance on the RarePlanes dataset. SWIN Transformer was top in F1 score. Meanwhile, CO-DETR showed the characteristic of having extremely few missed detections (AR50 of 97.59%).
2. A Trade-off Exists Between Performance and Computational Cost A trend was observed where models that deliver higher performance require more time for training.
Figure 3: Comparison of F1 score (blue) and training speed (orange, FPS) on the DOTA dataset. A clear trade-off is shown: high-performance models (e.g., CO-DETR) are slow, while fast models (e.g., YOLOv3) have slightly lower performance.
3. Transformers Demonstrate More Stable Performance While some CNN models showed significant performance fluctuations depending on the dataset (e.g., FCOS), Transformer models tended to exhibit relatively stable performance. In particular, SWIN, YOLOX, and CO-DETR consistently maintained top-class performance across all datasets.
4. Case Study: CNN Backbone vs. Transformer Backbone A comparison was made using the same detection algorithm (RetinaNet) but swapping the backbone (feature extractor) between a CNN (ResNeXt-101) and a Transformer (ViT).
The results of this research are expected to have applications in various practical fields that utilize satellite imagery.
The findings of this research are expected to be utilized in the following business areas:
While this study has made significant contributions, the following challenges are envisioned for further advancement:
This paper has clearly demonstrated through extensive and systematic experiments that in object detection tasks for remote sensing (satellite imagery), the Transformer architecture has the potential to surpass the performance of conventional CNNs.
In particular, models such as SWIN Transformer, YOLOX, and CO-DETR were found to deliver consistently high performance regardless of dataset characteristics. On the other hand, it was also clarified that this high performance comes with the computational cost of increased training time, making model selection based on the trade-off between performance and cost essential for practical implementation.
This study is an important milestone that indicates the future direction of next-generation object detection technology in the remote sensing field, providing valuable insights and resources (pre-trained models) that will accelerate research and development in this area.
2025-08-13
This study investigates whether deep learning models can predict a patient's self-reported race from skin histology images, examining the potential for demographic biases that AI may unintentionally learn. Through attention analysis, it reveals that the model uses specific tissue structures like the 'epidermis' as cues (shortcuts) to predict race. These findings highlight the importance of data management and bias mitigation for the fair implementation of medical AI in society.
2025-08-13
This paper proposes a new benchmark, 'Document Haystack,' which measures the ability to find specific information from long documents up to 200 pages long. This benchmark evaluates how accurately a Vision Language Model (VLM) can find intentionally embedded text or image information ('needles') within a document. The experimental results reveal that while current VLMs perform well on text-only documents, their performance significantly degrades on imaged documents or when handling information that combines text and images. This highlights future research challenges in the long-context and multimodal document understanding capabilities of VLMs.
2025-08-13
This paper proposes BF-PIP, a zero-shot method that utilizes Google's Gemini 2.5 Pro to predict pedestrian crossing intention without any additional training. Unlike conventional frame-based methods, it directly uses short, continuous videos and metadata such as ego-vehicle speed, achieving a high accuracy of 73% and demonstrating the potential for robust intention prediction based on contextual understanding.