GPT-5.5 Just Solved Its First Real ProgramBench Task — And Left Claude Opus 4.7 in the Dust

On May 12, 2026, the ProgramBench team dropped a small but very spicy update: they finally ran GPT-5.5 (and Claude Opus 4.7 for good measure) on their toughest settings — high** and xhigh (maximum reasoning budget and exploration time).

The result? One clean, unambiguous headline from the authors:

“GPT 5.5 (xhigh) is significantly better than Claude Opus 4.7 (xhigh) across all metrics.”

And they’re not exaggerating.

The First Full Solve Out of 200 Tasks

ProgramBench is one of the hardest coding benchmarks out there: 200 real-world programs (from tiny CLI tools to monsters like SQLite and FFmpeg). You’re given only the compiled binary and documentation — no source code, no internet, no decompilation. The model has to rebuild the entire thing from scratch so that it passes hidden behavioral tests.

GPT-5.5 Just Solved Its First Real ProgramBench Task — And Left Claude Opus 4.7 in the Dust

Until yesterday, zero models had fully solved any task at the highest difficulty.

Now there is one.

Task: cmatrix (a terminal-based “The Matrix” rain effect).
GPT-5.5 solved it twice:
- GPT-5.5 (high) → clean C implementation using raw ANSI escape sequences
- GPT-5.5 (xhigh) → self-contained Python 3 version

GPT-5.5 Just Solved Its First Real ProgramBench Task — And Left Claude Opus 4.7 in the Dust

Both runs passed 100% of the behavioral tests.

The Real Gap Shows Up at “Almost Solved”

Even more impressive is the 95%+ pass rate category (i.e. the model got almost everything right, missing only tiny edge cases):

GPT-5.5 Just Solved Its First Real ProgramBench Task — And Left Claude Opus 4.7 in the Dust

That’s not a small lead. That’s GPT-5.5 xhigh doing three times more useful work than the previous frontier model.

The cumulative histogram on ProgramBench tells the story even better: the GPT-5.5 xhigh curve sits noticeably to the right and above everything else. It doesn’t just solve more tasks — it solves more of each task.

What This Actually Means

We’re still not seeing the models in full agent mode (Codex-style editing loops, Claude Code, or the equivalent of a `/goal` cycle with multiple iterations and self-correction). The authors themselves note that higher reasoning budgets already make a massive difference, and proper agent scaffolding will push the numbers even higher.

Translation: the raw intelligence is already there. The scaffolding just needs to catch up.

The Bigger Picture

By the end of 2026 we’re very likely going to see the first practical coding agents that can take a vague prompt and reliably output tens of thousands of lines of working code — not perfect, but 95% there, which for many real-world use cases is already game-changing.

ProgramBench just gave us the clearest signal yet that the frontier has moved again, and it moved hard.

The age of “AI that can actually ship real software” isn’t coming.
It’s already knocking on the door.

Check the full update here:
GPT 5.5 high Solves First Instance!

And the specific task that just got crushed:
cmatrix on ProgramBench

Buckle up. The code generation winter is officially over.

GPT-5.5 Just Solved Its First Real ProgramBench Task — And Left Claude Opus 4.7 in the Dust

The First Full Solve Out of 200 Tasks

The Real Gap Shows Up at “Almost Solved”

What This Actually Means

The Bigger Picture

Subscribe to our newsletter