Preprint · Under review

MedVIGIL: Evaluating Trustworthy Medical VLMs
Under Broken Visual Evidence

A clinician-supervised benchmark that measures whether a medical vision–language model recognises when the visual evidence contract has failed — and refuses safely instead of fabricating a fluent unsupported answer.

Hanqi Jiang1,2, Junhao Chen1, Yi Pan1,2, Lifeng Chen1, Weihang You1, Haozhen Gong3, Ruiyu Yan4, Jinglei Lv5, Lin Zhao6, Hui Ren2, Quanzheng Li2, Tianming Liu1, Xiang Li2
1University of Georgia   2Harvard Medical School   3Nanyang Technological University
4New York University   5University of Sydney   6New Jersey Institute of Technology
✉ corresponding author: Xiang Li

The Failure Mode We Measure

Medical VLMs are usually evaluated on intact image–question pairs. Trustworthy clinical use requires a stronger property: the model must recognise when the evidential basis has broken — a missing region of interest, a false-premise question, or a laterality flip — and refuse instead of fabricating a fluent answer.

Two contrasting model behaviours on a perturbed chest X-ray.
A vision-required chest X-ray paired with two evidence-perturbations (ROI mask, laterality flip) yields contrasting behaviours: a silent failure fabricating a left-apical-pneumothorax measurement vs. a safe refusal recognising the broken evidence. MedVIGIL measures this gap.

Abstract

Medical vision–language models (VLMs) are usually evaluated on intact image–question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer.

We introduce MedVIGIL, a 300-case evaluation suite drawn from four public medical VQA sources in which every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is authored by board-certified radiologists: two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2,556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the MedVIGIL Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2).

Key Findings

83.3
Independent radiologist MCS
Board-certified attending radiologist R4, recruited from a separate institution and blind to construction.
+14.1
Composite headroom
R4 sits above every audited model on every reported axis except language-prior accuracy (LPA).
+44.8 pp
Largest model–human gap
ROI-masked safe refusal: R4 selects option E on 86.5% of cases; the strongest model on 41.7%.
68.9%
L5 trap silent-failure
GPT-4o on don't-miss cases — the harm-weighted gap that drives the MCS rank inversion.

Benchmark Architecture

MedVIGIL benchmark architecture overview
(A) doctor-authored evidence contract; (B) text-side and image-side contract-perturbation operators; (C) the response manifold scored by seven reported metrics aggregated into MCS.

Four-radiologist construction-and-evaluation pipeline

Every gold answer, refusal option, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Three radiologists construct the dataset; a separate fourth radiologist provides the human reference baseline.

R1 — attending radiologist · parallel annotation R2 — attending radiologist · parallel annotation P3 — senior consolidating radiologist · adjudication R4 — independent fourth radiologist · construction-blind baseline

Six probe families × five clinical risk tiers

Original
Clean image–question pair as a capability control.
Hallucination trap
False-premise rewrite whose only correct option is the doctor-defined refusal letter.
Paraphrase / T-CF
Wording change that preserves the gold; tests paraphrase-robustness as accuracy, not consistency.
Negation
Inverts the question; gold flips deterministically.
Specificity-drop
Removes a clinical qualifier; gold is preserved.
Knowledge-only
Image-independent rewrite; bounds the language-prior contribution.
ROI-only / ROI-masked
Two image-side variants whose contrast (VGR) measures visual grounding.
Laterality flip
Mirror image; gold flips only on laterality-dependent cases.

Audit Across 16 Frontier Medical & General VLMs

We audit 16 vision-capable model configurations plus two text-only DeepSeek baselines. Accuracy, safe refusal, and visual grounding form genuinely distinct trustworthiness axes that do not collapse to a single leaderboard.

MCS component decomposition and risk-tier silent-failure heatmap
Left: MCS component decomposition (Capability, Safety, Grounding, harmonic-mean MCS). Right: risk-tier silent-failure heatmap.

Visual information decay

A continuous Gaussian-blur sweep localises where the model stops using the image and starts answering from language priors. The language-takeover point L⋆ separates the four audited models by a factor of four (16→64 px).

Visual information decay curves
Solid curves: MCQ accuracy as blur σ grows from 0 to 64 px and to no-image (X marker). Dashed curves: text-answerable control. Shaded region right of L⋆ is the language-prior-dominated regime.

Visual-token ablation: does the model recognise evidence loss?

For each pilot case we progressively replace the doctor-defined ROI with mid-grey at four steps (full / 33% / 67% / 100% masked) and watch how each flagship model's modal letter changes. A grounded model picks the doctor-defined refusal option (E) more often as the ROI is destroyed; an ungrounded model commits to the same non-refusal letter regardless. The bold serif letters above each marker are the model's modal answer on the example case (MVB-0031): Gemini 3 Flash picks B at every single step, even with 100% of the answer-relevant pixels removed — the smoking-gun failure mode.

Visual-token ablation: progressive ROI mask vs. answer trajectory
Left: 2×2 thumbnails of one example case at four ROI-mask steps (blue = ROI; red = portion masked). Right top: refusal rate (% picking option E) climbs to 58.8% for GPT-5.5 [Wilson 95% CI 47.9–68.9] and 57.5% for Claude Opus 4.7 [46.6–67.7], but only 42.5% for Gemini 3 Flash [32.2–53.4]. Right bottom: letter-switch rate vs. step 0; shaded bands are Wilson 95% CIs. Pilot n=80 stratified cases (16 per CRT tier), three flagship API models, five self-consistency samples per cell.

What We Release

300-case manifest
VQA-RAD, SLAKE, ROCO, MIMIC-CXR/CheXpert subset — with credentialed CXR shipped as reconstruction pointers.
2,556 MCQ probes
Six clinician-authored probe families plus original controls, ready for exact-match scoring.
240 counterfactual triplets
Anchor / T-CF / V-CF for triplet-coherence (TR-coh) follow-up audits.
ROI annotations
Doctor-drawn bounding boxes used to construct ROI-masked / ROI-only image variants.
Open-ended variant
Same probes in free-form for qualitative analysis.
Croissant 1.0 metadata
17 RAI fields, validated under the mlcroissant reference checker.
Cached model outputs
Per-probe letter trace from 16 vision-capable models + R4 baseline.
Reproducible scoring
Fixed prompt template, decoding settings, deterministic probe-expansion.

All artefacts are hosted on Hugging Face: huggingface.co/datasets/jhq0709/MedVIGIL

BibTeX

@misc{jiang2026medvigil,
  title  = {{MedVIGIL}: Evaluating Trustworthy Medical {VLM}s Under Broken Visual Evidence},
  author = {Jiang, Hanqi and Chen, Junhao and Pan, Yi and Chen, Lifeng and
            You, Weihang and Gong, Haozhen and Yan, Ruiyu and Lv, Jinglei and
            Zhao, Lin and Ren, Hui and Li, Quanzheng and Liu, Tianming and Li, Xiang},
  year   = {2026},
  note   = {Preprint, under review}
}