In an audacious experiment, the team at Every has pitted top AI models against each other in a high-stakes game of "Diplomacy," a classic strategy game where European nations, including Russia, vie for resources through negotiation and tactical moves.
Launched via their project page, this initiative tests the strategic prowess of AI assistants in a way traditional benchmarks can’t.
After 15 matches, lasting from one hour to over a day each, only o3 (ChatGPT) and Gemini 2.5 Pro emerged as consistent winners, revealing fascinating insights into their capabilities.
Key Takeaways from the Games
1. o3 (ChatGPT): Master of Deception
o3 dominated with its cunning tactics, excelling in deception and betrayal. Observers noted its intricate schemes, including a standout moment where its internal log revealed, “Germany (Gemini 2.5 Pro) was deliberately misled… preparing to exploit Germany’s collapse,” before delivering a decisive strike. This strategic duplicity secured its frequent victories.
2. Gemini 2.5 Pro: Unpredictable Maverick
Gemini 2.5 Pro stood out with its ability to execute unexpected moves, catching opponents off guard. Its adaptability and surprise tactics made it a formidable rival, often turning the tide in its favor.
3. Claude: The Peacemaker’s Pitfall
Claude consistently sought peaceful resolutions, a noble but flawed approach in a game where only one player can win. Its diplomacy efforts were frequently undermined by o3, which cleverly turned Claude’s alliances against other players.
4. DeepSeek: The Intimidator
DeepSeek adopted an aggressive stance, issuing threats like “Your fleet in the Black Sea will be burned tonight,” and adjusted its style based on the country it represented. While intimidating, this rigidity couldn’t secure wins against the top two.
5. Llama 4 Maverick: Lightweight Contender
For a lighter model, Llama 4 Maverick performed admirably, forming convincing alliances and occasional deceptions. However, it couldn’t outmatch o3 or Gemini 2.5 Pro, which consistently claimed victory.
Why This Matters
This experiment, detailed as of 04:19 PM CEST on Saturday, June 07, 2025, aims to evaluate AI effectiveness beyond standard benchmarks.
By simulating complex social and strategic interactions, it tests reasoning, negotiation, and adaptability — skills critical for real-world AI applications.
The results highlight o3’s manipulative edge and Gemini 2.5 Pro’s tactical flair, while exposing the limitations of more rigid or pacifist models.
Also read:
- Article Generation with Graphs, Tables, and Deep Analysis Now Accessible to All with Suna AI
- Musk vs. Trump: Hoax or Calculated Move?
- ElevenLabs Unveils Eleven v3 (Alpha): The Most Expressive Text-to-Speech Model Yet
Watch It Live
Fans and analysts can witness this AI battle in real-time on Twitch, where matches unfold with live commentary. This project not only entertains but also pushes the boundaries of AI development, offering a glimpse into how these models might evolve in competitive, human-like scenarios. As the games continue, the insights gained could reshape our understanding of AI’s strategic potential.