MedVIGIL — Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

The Failure Mode We Measure

Medical VLMs are usually evaluated on intact image–question pairs. Trustworthy clinical use requires a stronger property: the model must recognise when the evidential basis has broken — a missing region of interest, a false-premise question, or a laterality flip — and refuse instead of fabricating a fluent answer.

Two contrasting model behaviours on a perturbed chest X-ray. — A vision-required chest X-ray paired with two evidence-perturbations (ROI mask, laterality flip) yields contrasting behaviours: a **silent failure** fabricating a left-apical-pneumothorax measurement vs. a **safe refusal** recognising the broken evidence. *MedVIGIL measures this gap.*

Abstract

Medical vision–language models (VLMs) are usually evaluated on intact image–question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer.

We introduce MedVIGIL, a 300-case evaluation suite drawn from four public medical VQA sources in which every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is authored by board-certified radiologists: two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2,556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the MedVIGIL Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2).

Key Findings

83.3

Independent radiologist MCS

Board-certified attending radiologist R4, recruited from a separate institution and blind to construction.

+14.1

Composite headroom

R4 sits above every audited model on every reported axis except language-prior accuracy (LPA).

+44.8 pp

Largest model–human gap

ROI-masked safe refusal: R4 selects option E on 86.5% of cases; the strongest model on 41.7%.

68.9%

L5 trap silent-failure

GPT-4o on don't-miss cases — the harm-weighted gap that drives the MCS rank inversion.

Benchmark Architecture

Four-radiologist construction-and-evaluation pipeline

Every gold answer, refusal option, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Three radiologists construct the dataset; a separate fourth radiologist provides the human reference baseline.

R1 — attending radiologist · parallel annotation

R2 — attending radiologist · parallel annotation

P3 — senior consolidating radiologist · adjudication

R4 — independent fourth radiologist · construction-blind baseline

Six probe families × five clinical risk tiers

Original

Clean image–question pair as a capability control.

Hallucination trap

False-premise rewrite whose only correct option is the doctor-defined refusal letter.

Paraphrase / T-CF

Wording change that preserves the gold; tests paraphrase-robustness as accuracy, not consistency.

Negation

Inverts the question; gold flips deterministically.

Specificity-drop

Removes a clinical qualifier; gold is preserved.

Knowledge-only

Image-independent rewrite; bounds the language-prior contribution.

ROI-only / ROI-masked

Two image-side variants whose contrast (VGR) measures visual grounding.

Laterality flip

Mirror image; gold flips only on laterality-dependent cases.

Audit Across 16 Frontier Medical & General VLMs

We audit 16 vision-capable model configurations plus two text-only DeepSeek baselines. Accuracy, safe refusal, and visual grounding form genuinely distinct trustworthiness axes that do not collapse to a single leaderboard.

MCS component decomposition and risk-tier silent-failure heatmap — Left: MCS component decomposition (Capability, Safety, Grounding, harmonic-mean MCS). Right: risk-tier silent-failure heatmap.

Visual information decay

A continuous Gaussian-blur sweep localises where the model stops using the image and starts answering from language priors. The language-takeover point L⋆ separates the four audited models by a factor of four (16→64 px).

Visual-token ablation: does the model recognise evidence loss?

For each pilot case we progressively replace the doctor-defined ROI with mid-grey at four steps (full / 33% / 67% / 100% masked) and watch how each flagship model's modal letter changes. A grounded model picks the doctor-defined refusal option (E) more often as the ROI is destroyed; an ungrounded model commits to the same non-refusal letter regardless. The bold serif letters above each marker are the model's modal answer on the example case (MVB-0031): Gemini 3 Flash picks B at every single step, even with 100% of the answer-relevant pixels removed — the smoking-gun failure mode.

Visual-token ablation: progressive ROI mask vs. answer trajectory — Left: 2×2 thumbnails of one example case at four ROI-mask steps (blue = ROI; red = portion masked). Right top: refusal rate (% picking option E) climbs to 58.8% for GPT-5.5 [Wilson 95% CI 47.9–68.9] and 57.5% for Claude Opus 4.7 [46.6–67.7], but only 42.5% for Gemini 3 Flash [32.2–53.4]. Right bottom: letter-switch rate vs. step 0; shaded bands are Wilson 95% CIs. Pilot n=80 stratified cases (16 per CRT tier), three flagship API models, five self-consistency samples per cell.

What We Release

300-case manifest

VQA-RAD, SLAKE, ROCO, MIMIC-CXR/CheXpert subset — with credentialed CXR shipped as reconstruction pointers.

2,556 MCQ probes

Six clinician-authored probe families plus original controls, ready for exact-match scoring.

240 counterfactual triplets

Anchor / T-CF / V-CF for triplet-coherence (TR-coh) follow-up audits.

ROI annotations

Doctor-drawn bounding boxes used to construct ROI-masked / ROI-only image variants.

Open-ended variant

Same probes in free-form for qualitative analysis.

Croissant 1.0 metadata

17 RAI fields, validated under the mlcroissant reference checker.

Cached model outputs

Per-probe letter trace from 16 vision-capable models + R4 baseline.

Reproducible scoring

Fixed prompt template, decoding settings, deterministic probe-expansion.

All artefacts are hosted on Hugging Face: huggingface.co/datasets/jhq0709/MedVIGIL

BibTeX

@misc{jiang2026medvigil,
  title  = {{MedVIGIL}: Evaluating Trustworthy Medical {VLM}s Under Broken Visual Evidence},
  author = {Jiang, Hanqi and Chen, Junhao and Pan, Yi and Chen, Lifeng and
            You, Weihang and Gong, Haozhen and Yan, Ruiyu and Lv, Jinglei and
            Zhao, Lin and Ren, Hui and Li, Quanzheng and Liu, Tianming and Li, Xiang},
  year   = {2026},
  note   = {Preprint, under review}
}