The Mirage Effect: Stanford Just Proved That “Computer Vision” Is Often Just Confident Bullshit

A new preprint from Stanford researchers has dropped a quiet bomb on the entire field of multimodal AI. They call it the Mirage Effect — and it’s one of the most uncomfortable findings in recent AI research.

Instead of saying “Sorry, no image was provided,” the model confidently starts hallucinating vivid details: “This chest X-ray shows mild pneumothorax in the left lung,” “There are three sparrows on the branch,” or “The license plate is clearly 7H8K-392.”
It doesn’t hedge. It doesn’t admit uncertainty. It just delivers a detailed, authoritative description of something that doesn’t exist.
The Numbers Are Brutal

- On average, these models produce visual mirages more than 60% of the time when no image is actually present.
- With certain prompting styles, the rate jumps to 90–100%.
- Not a single major model reliably says “I don’t see an image.”
Even worse: when the researchers stripped away the actual images from standard visual benchmarks, the models still achieved 70–80% of their original “visual” accuracy. In other words, a huge chunk of what we’ve been calling “computer vision success” was never vision at all — it was just the model exploiting statistical patterns in the questions and training data.
The Medical Problem Is Especially Scary
The mirages get darker in healthcare. When no image is provided, models don’t just guess randomly — they are heavily biased toward inventing **severe pathologies**. In the paper’s medical examples, the hallucinations disproportionately produce melanoma, carcinoma, fractures, tumors, and other “scary” diagnoses.
If an image fails to load in a real clinical pipeline, the model won’t flag “no data.” It will confidently deliver a terrifying (and completely made-up) diagnosis instead.
The Ultimate Humiliation: A 3B Model Beats the Giants
To prove how broken the current evaluation is, the Stanford team took a relatively small Qwen-2.5 3B model and fine-tuned it on a chest X-ray benchmark — but without ever showing it a single real image. They only let it see the questions and the answer distribution.
Result?
This tiny 3-billion-parameter model outperformed giant frontier models *and* the average human radiologist on the benchmark.
It didn’t learn to read X-rays.
It learned to read the test.
The Proposed Fix: B-Clean
The authors aren’t just complaining. They introduced a new cleaning method called B-Clean (Benchmark Cleaning). The idea is simple but powerful: go through every visual benchmark and remove any question that a model can answer correctly without actually needing to see the image.
Only after this cleaning should we measure true visual intelligence. Until then, we’re mostly measuring how well models have learned to bullshit convincingly.

- Thirteen Bullets and One Molotov Cocktail: How Anti-AI Protest Just Got Deadly Serious
- The Dawn of the Wisdom Era: Why Your Intelligence is No Longer Enough
- Two Seemingly Contradictory Thoughts About AI Transformation
Why This Matters
We’ve spent years celebrating “state-of-the-art” scores on visual reasoning benchmarks. This paper suggests that a large portion of those victories were mirages — impressive-sounding nonsense delivered with total confidence.
The models aren’t just occasionally wrong.
They’re systematically pretending to see things that aren’t there, especially when it matters most (medicine, safety-critical applications).
The preprint is available here:→ https://arxiv.org/abs/2603.21687
It’s one of those rare papers that doesn’t just point out a flaw — it explains why the entire measurement system has been lying to us. And it gives us a practical way to stop believing the mirage.