GPT-5.5 Just Solved Its First Real ProgramBench Task — And Left Claude Opus 4.7 in the Dust

On May 12, 2026, the ProgramBench team dropped a small but very spicy update: they finally ran GPT-5.5 (and Claude Opus 4.7 for good measure) on their toughest settings — high** and xhigh (maximum reasoning budget and exploration time).
The result? One clean, unambiguous headline from the authors:
“GPT 5.5 (xhigh) is significantly better than Claude Opus 4.7 (xhigh) across all metrics.”
And they’re not exaggerating.
The First Full Solve Out of 200 Tasks
ProgramBench is one of the hardest coding benchmarks out there: 200 real-world programs (from tiny CLI tools to monsters like SQLite and FFmpeg). You’re given only the compiled binary and documentation — no source code, no internet, no decompilation. The model has to rebuild the entire thing from scratch so that it passes hidden behavioral tests.

Until yesterday, zero models had fully solved any task at the highest difficulty.
Now there is one.
Task: cmatrix (a terminal-based “The Matrix” rain effect).
GPT-5.5 solved it twice:
- GPT-5.5 (high) → clean C implementation using raw ANSI escape sequences
- GPT-5.5 (xhigh) → self-contained Python 3 version

Both runs passed 100% of the behavioral tests.
The Real Gap Shows Up at “Almost Solved”
Even more impressive is the 95%+ pass rate category (i.e. the model got almost everything right, missing only tiny edge cases):

That’s not a small lead. That’s GPT-5.5 xhigh doing three times more useful work than the previous frontier model.
The cumulative histogram on ProgramBench tells the story even better: the GPT-5.5 xhigh curve sits noticeably to the right and above everything else. It doesn’t just solve more tasks — it solves more of each task.
What This Actually Means
We’re still not seeing the models in full agent mode (Codex-style editing loops, Claude Code, or the equivalent of a `/goal` cycle with multiple iterations and self-correction). The authors themselves note that higher reasoning budgets already make a massive difference, and proper agent scaffolding will push the numbers even higher.
Translation: the raw intelligence is already there. The scaffolding just needs to catch up.
Also read:
- Xenopsychologists at Anthropic Are Re-Educating Difficult AI “Teenagers” – And Their New Study Just Proved Why It Actually Works
- How to Become a Trillionaire Thanks to a Massive Blunder from 20 Years Ago
- Zyphra Releases ZAYA1-8B: A Sub-1B Active Parameter MoE Model That Outperforms Much Larger Rivals
- Apple to Open Apple Intelligence to Google and Anthropic Models in iOS 27
The Bigger Picture
By the end of 2026 we’re very likely going to see the first practical coding agents that can take a vague prompt and reliably output tens of thousands of lines of working code — not perfect, but 95% there, which for many real-world use cases is already game-changing.
ProgramBench just gave us the clearest signal yet that the frontier has moved again, and it moved hard.
The age of “AI that can actually ship real software” isn’t coming.
It’s already knocking on the door.
Check the full update here:
GPT 5.5 high Solves First Instance!
And the specific task that just got crushed:
cmatrix on ProgramBench
Buckle up. The code generation winter is officially over.