Citation: http://arxiv.org/pdf/2507.21912v2
This paper, "Predicting patient self-reported race from skin histopathology images with deep learning," is a study focusing on a critical issue in the field of computational pathology[Note 1]: unintentional bias learning in AI models. The research examines whether a deep learning model can predict a patient's self-reported race from digital images of pathological tissue and clarifies what morphological "shortcuts"[Note 2] are used in making those predictions.
Research Objective: The main objective of this study is to verify whether a deep learning model can identify race from skin histology images and, if so, to identify the biological features that serve as the basis for its judgment. This aims to gain insights for prospectively evaluating and mitigating the risk of medical AI unintentionally disadvantaging specific populations due to hidden biases when applied in clinical settings.
Research Background: While AI, particularly deep learning, has shown remarkable success in disease detection and prognosis prediction, it has also been pointed out that it can learn biases present in training data, potentially exacerbating or amplifying existing healthcare disparities. In medical imaging, such as X-rays, it has been reported that AI can predict race with high accuracy from features imperceptible to experts, sparking significant debate. However, it remained largely unknown whether similar predictions were possible with histopathology images, which observe structures at the cellular level. Skin is particularly interesting because its appearance (skin color) is associated with race, but this difference becomes less obvious in stained tissue specimens, making it a compelling question what cues an AI might use.
Highlights of the Proposed Method: The most crucial aspect of the approach proposed in this study is the use of an AI model equipped with an attention mechanism[Note 3] to visualize "where" in the image the model is focusing when predicting race. As a result, they discovered that the AI uses a specific tissue structure, the "epidermis," as a strong cue to predict race. This is a groundbreaking achievement that concretely demonstrates the risk of AI learning biological features correlated with race as a "shortcut," rather than the disease itself.
While previous research primarily focused on race prediction in radiological images or technical biases in pathology images, this study is unique in that it narrows its focus to the specific field of dermatopathology and attempts to identify the biological and morphological cues (shortcuts) that enable race prediction. It does not simply report that "the model could predict race" but delves deeper into the basis of its judgment through attention analysis and UMAP visualization.
This study offers a new perspective compared to existing research in the following ways:
This study makes the following academic and practical contributions:
This paper uses a pipeline that combines a Foundation Model (FM) and Attention-based Multiple Instance Learning (AB-MIL) to classify race from skin histology images.
The proposed method consists of the following steps:
The results revealed that AI can predict race and that this prediction is strongly influenced by biases present in the data.
Experiment | Encoder | White | Black | Hispanic | Asian | Other | Overall AUC | Overall Acc. |
---|---|---|---|---|---|---|---|---|
Exp1 | UNI | 0.797 | 0.791 | 0.607 | 0.791 | 0.603 | 0.718 | 0.400 |
(Unadjusted) | Avg | 0.789 | 0.770 | 0.596 | 0.795 | 0.563 | 0.702 | 0.394 |
Exp2 | UNI | 0.760 | 0.773 | 0.560 | 0.715 | 0.569 | 0.676 | 0.380 |
(Disease-balanced) | Avg | 0.742 | 0.754 | 0.560 | 0.724 | 0.574 | 0.671 | 0.364 |
Exp3 | UNI | 0.819 | 0.766 | 0.654 | 0.556 | 0.594 | 0.678 | 0.296 |
(Strict ICD code) | Avg | 0.799 | 0.762 | 0.640 | 0.570 | 0.543 | 0.663 | 0.302 |
Table 1: Model performance across three dataset curation strategies. AUC is calculated using a One-vs-Rest approach.
Finding 1 (Impact of Bias):
Finding 2 (Identification of Shortcuts):
UMAP Visualization: When the regions the model focused on during prediction (high-attention regions) were visualized with UMAP, it was found that attention was concentrated in regions corresponding to the "epidermis," especially for the White and Black groups.
Figure 1: UMAP visualization of attention scores. (A) shows the high-attention regions (top 10%) for each racial group with contour lines, indicating that attention is concentrated in specific areas for White and Black groups. (B) through (D) show that high-attention regions are associated with specific tissue structures such as the "epidermis."
Ablation Study: To confirm how crucial the epidermis was for prediction, an experiment was conducted where tiles from the epidermal region were intentionally removed from the validation data.
Figure 2: Attention and ablation analysis. (A) compares attention scores for epidermal and non-epidermal regions, showing that the epidermis receives higher attention in many groups. (B) shows the results of the ablation experiment, where removing epidermal tiles (orange) significantly drops performance from the original (green), and conversely, keeping only epidermal tiles (blue) maintains performance.
The results confirmed that removing the epidermal region significantly degraded the model's performance, while conversely, performance was maintained even when only the epidermal region was left. This is definitive proof that the AI model uses the morphological features of the "epidermis" as a powerful shortcut for race prediction.
The findings of this research are not intended for direct application in products or services but are extremely important as "guidelines" for developing and evaluating medical AI.
This research suggests new market opportunities for developing more reliable and fair AI.
While this study has provided important insights, several challenges remain.
This paper revealed that a deep learning model can predict a patient's self-reported race from skin histopathology images with moderate accuracy. It determined that this prediction is likely based on leveraging dataset biases, such as disease distribution, and using morphological features of tissues like the "epidermis" as a "shortcut."
These findings strongly suggest the necessity of carefully considering demographic biases when developing and evaluating AI models in computational pathology. To achieve fair and reliable medical AI, it is essential to constantly verify that the model is learning the intrinsic features of a disease and to make efforts to reduce the risk of relying on unintentional shortcuts.
2025-08-13
This paper presents a large-scale, systematic comparison and analysis of the performance of Transformer models, which have recently gained attention, and conventional mainstream CNN models for object detection tasks using remote sensing data such as satellite imagery. The study evaluates 11 different models on three datasets with distinct characteristics, revealing the potential for Transformers to outperform CNNs and clarifying the trade-off with the associated training costs.
2025-08-13
This paper proposes a new benchmark, 'Document Haystack,' which measures the ability to find specific information from long documents up to 200 pages long. This benchmark evaluates how accurately a Vision Language Model (VLM) can find intentionally embedded text or image information ('needles') within a document. The experimental results reveal that while current VLMs perform well on text-only documents, their performance significantly degrades on imaged documents or when handling information that combines text and images. This highlights future research challenges in the long-context and multimodal document understanding capabilities of VLMs.
2025-08-13
This paper proposes BF-PIP, a zero-shot method that utilizes Google's Gemini 2.5 Pro to predict pedestrian crossing intention without any additional training. Unlike conventional frame-based methods, it directly uses short, continuous videos and metadata such as ego-vehicle speed, achieving a high accuracy of 73% and demonstrating the potential for robust intention prediction based on contextual understanding.