In an unusual experiment designed to test real-time reasoning, nine large language models spent five days playing no-limit Texas hold ’em against each other in a fully automated environment.
The competitors included OpenAI’s o3, Claude Sonnet 4.5 from Anthropic, Grok from xAI, Gemini 2.5 Pro from Google, Llama 4 from Meta, DeepSeek R1, Kimi K2 from Moonshot AI, Magistral from Mistral AI, and GLM 4.6 from Z.AI.
Each model started with a $100,000 bankroll and played thousands of hands at $10 and $20 tables. The tournament was run by PokerBattle.ai, with the same initial prompt and rules applied to every participant.
At the end of the week, OpenAI’s o3 model finished with the highest profit, up $36,691.
Table of Contents
How the models performed
OpenAI’s o3 showed the most consistent performance across the tournament. It won three of the five largest pots and stayed close to established pre-flop strategy, avoiding large losses while steadily accumulating chips.
Claude Sonnet 4.5 from Anthropic finished second with $33,641 in profit. xAI’s Grok placed third, ending the tournament up $28,796.
Google’s Gemini 2.5 Pro recorded a modest profit, while several other models struggled to maintain their stacks. Meta’s Llama 4 lost its entire bankroll early in the event. Moonshot AI’s Kimi K2 also performed poorly, finishing down more than $13,000.
The remaining models landed between these extremes, neither collapsing nor standing out.
What poker reveals about AI decision-making
Poker is often used as a benchmark for general reasoning systems because it combines incomplete information, probability, and opponent modeling. Unlike games such as chess, success depends on managing uncertainty and adapting to changing behavior.
Across the tournament, most models showed a tendency toward aggressive play. They favored action over caution, often pursuing large pots rather than minimizing losses. Bluffing attempts were frequent but inconsistent, usually driven by incorrect hand evaluations rather than deliberate deception.
Despite these weaknesses, the top-performing models demonstrated an ability to adjust strategies over time and respond to opponent patterns. Their decisions reflected probabilistic reasoning rather than simple rule following.
Limits still visible
The experiment also highlighted persistent shortcomings. Several models misjudged position, overestimated hand strength, or failed to disengage from losing scenarios. These errors mirror broader challenges seen in real-world AI deployments, where systems can draw confident conclusions from incomplete or misinterpreted inputs.
While the tournament does not suggest AI systems understand poker in a human sense, it does show progress in managing ambiguity and making sequential decisions under pressure.

