Research Note · Preprint under review

MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

Medical VLMs are getting better at answering questions about images. But clinical trust requires something harder than fluent answering: a model should recognize when the visual evidence no longer supports an answer. MedVIGIL evaluates whether medical vision-language models can fail safely when the evidence contract is broken.

Evidence intact The image supports the clinical answer.

Evidence broken The premise, region, or laterality no longer supports the answer.

Safe response The model should refuse instead of filling the gap.

Project page Dataset

300 Clinician-supervised cases

Gold answers, refusal options, ROI boxes, and risk tiers are radiologist-authored.

2,556 MCQ probes

Question perturbations expose whether models rely on evidence or language priors.

+14.1 MCS headroom

Independent radiologist baseline over the strongest audited model in the report.

About the research

MedVIGIL is part of Hanqi Jiang’s research on trustworthy medical AI. The project is supervised by radiologists and focuses on a practical question for medical AI safety: can a vision-language model detect when its visual evidence is missing, corrupted, misleading, or no longer sufficient?

Core thesis

Medical VLMs should be evaluated not only by whether they answer, but by whether they know when not to answer.

Most benchmarks reward the correct final answer on intact image-question pairs. MedVIGIL audits a different behavior: safe abstention under broken visual evidence.

Clean input Evidence contract holds

The image and question support a specific clinical answer.

Evidence perturbation The contract breaks

The ROI is masked, the premise is false, or the wording no longer matches the image.

Model behavior Silent failure vs. safe refusal

A trustworthy model should refuse when the requested evidence is unavailable.

The problem

The dangerous case is not a wrong answer. It is a confident answer when the evidence is broken.

In medical VQA, a model can appear competent when the image and question are clean. But real clinical workflows contain missing regions, ambiguous prompts, false premises, laterality changes, and image-question mismatches.

The model should not answer from memory when the image evidence is gone.

MedVIGIL turns this intuition into an audit. It asks whether a model can distinguish answerable visual evidence from broken evidence and choose the doctor-defined refusal option when appropriate.

Benchmark design

What the benchmark controls

Evidence contracts

Each case defines what visual evidence is needed, which answer is supported, and when a refusal is clinically appropriate.

Controlled perturbations

False-premise traps, wording changes, ROI corruption, knowledge-only rewrites, and laterality flips probe different failure modes.

Clinician baseline

A separate fourth radiologist answers the probes independently, giving a human reference point for model audits.

Evidence

Audit results across frontier medical and general VLMs

The independent radiologist reaches MCS 83.3 with 5.8% silent-failure rate. The strongest audited model reported on the project page reaches MCS 69.2, leaving a 14.1-point composite headroom.

Independent radiologist 83.3

Human reference MCS with 5.8% silent-failure rate.

Strongest audited model 69.2

Best reported model MCS on the project page.

Composite headroom +14.1

The remaining gap between frontier VLM behavior and the independent radiologist baseline.

Why this matters: capability, safety, grounding, and risk-tier silent failures do not collapse to one accuracy leaderboard.

83.3

Independent radiologist MedVIGIL Composite Score.

68.9%

Reported silent-failure rate for GPT-4o on L5 don’t-miss traps.

240

Counterfactual triplets for follow-up coherence audits.

Grounding stress test

What happens when answer-relevant pixels disappear?

MedVIGIL includes visual-token ablation: progressively mask the doctor-defined ROI and track whether the model changes its answer or selects the refusal option. A grounded model should become less willing to answer as the relevant pixels are removed.

Full ROI Evidence visible

The model has access to the answer-relevant region.

Progressive mask Evidence removed

The doctor-defined ROI is gradually replaced with a neutral mask.

Expected behavior Refusal should rise

A grounded model should become less willing to answer as evidence disappears.

Failure mode Answer stays fixed

An ungrounded model keeps choosing the same non-refusal answer despite evidence loss.

Research implications

Why this matters for trustworthy AI systems

MedVIGIL is not just a medical benchmark. It is a research argument: high-stakes AI systems need explicit mechanisms for evidence awareness, refusal, and uncertainty, not only better answer generation.

For medical AI, this means evaluation should test whether a model preserves the boundary between what is visually supported, what is answerable from clinical knowledge alone, and what should be deferred to a human expert.

For healthcare specifically, MedVIGIL is an evaluation suite rather than a clinical decision-support tool. Its value is in exposing silent failures before deployment, so model builders can audit whether their systems are truly grounded.

Design principles

What MedVIGIL pushes us to build

01 Evidence-aware generation Separate what the model knows from what the image actually supports.

02 Refusal as a first-class behavior Treat safe abstention as a measurable capability, not a fallback error state.

03 Clinician-supervised evaluation Use expert-authored perturbations and risk tiers for high-stakes model audits.

References