AI is changing the world. Don't stay behind. Clear summaries, community insight, delivered without the noise. Subscribe to never miss a beat.

MTG Bench: Testing how well LLMs can play Magic

mtgautodeck.com

June 11, 2026

5 min read

🔥🔥🔥🔥🔥

45/100

Summary

Fable 5 successfully plays a scry land and examines the top card of the deck, while Gemini 3.5 performs a complex turn involving scry, discover, and tutor effects. The benchmark tests the capability of LLMs to play Magic: The Gathering without relying on a rules engine, suggesting that high-performing LLMs should not require one.

Key Takeaways

The MTG Bench tests the performance of large language models (LLMs) in playing Magic: The Gathering, highlighting both successful and failed simulations.
LLMs demonstrated better performance in evaluating the legality of simulated turns than in executing legal turn simulations.
Using a remote MCP server for LLM API calls allows for cost savings by reducing the number of cached input token charges.
The benchmark penalizes models that excessively call tools, as mistakes in card drawing cannot be easily undone in the context of Magic: The Gathering gameplay.

Read original article

Community Sentiment

Mixed

Positives

This benchmark is timely and addresses the shortcomings of existing benchmarks, highlighting the complexity of Magic's mechanics and edge cases.
Testing LLMs with maximum thinking time and web search capabilities showed they could follow rules better than average players, suggesting potential for advanced AI in gaming.
The exploration of letting LLMs play Magic like humans could lead to more engaging gameplay experiences, emphasizing natural interaction over rigid rule enforcement.

Concerns

Concerns were raised about the reliability of LLM grading, suggesting that using a rules engine might yield more consistent results than relying on LLMs alone.
The scoring method based on a simple prompt may not adequately capture the quality of the simulation, indicating potential biases in evaluation.

Building for an audience of one: starting and finishing side projects with AI

Feb 17, 2026

Running Google Gemma 4 Locally With LM Studio’s New Headless CLI & Claude Code

Running Gemma 4 locally with LM Studio's new headless CLI and Claude Code

Apr 5, 2026

Local Qwen isn't a worse Opus, it's a different tool

Jun 18, 2026

Agentic test processes, LLM benchmarks, and other notes on agentic coding from Galapagos Island

Agentic coding notes from Galapagos Island

Jul 4, 2026

MTG Bench: Testing how well LLMs can play Magic

Related Articles