Not a new standard, but still exciting: Mario as a benchmark for AIs.
Benchmarks for AI models are often dry: mathematics, logic tests, complex data analyses. But researchers at UC San Diego have taken a new approach – and simply let their AIs Super Mario Bros play, as TechSpot reports.
Sounds like a curious experiment? Perhaps. But it certainly shows that timing is sometimes more important than pure computing power.
The experiment: GamingAgent as an AI controller
The researchers at the Hao AI Lab at the University of California San Diego developed the GamingAgent framework (GitHub), which allows AI models to control plumber Mario via Python code.
An emulated version of Super Mario Bros. on the NES served as the basis. The AIs were given simple instructions such as Jump over this enemy
as well as screenshots for orientation.
The goal was to find out how well the models could plan their actions and adapt them in real time.
Claude-3.7 was tested on Pokémon Red, but what about more real-time games like Super Mario ?
We threw AI gaming agents into LIVE Super Mario games and found Claude-3.7 outperformed other models with simple heuristics.
Claude-3.5 is also strong, but less capable of… pic.twitter.com/bqZVblwqX3
— Hao AI Lab (@haoailab) February 28, 2025
Claude 3.7 dominates – GPT-4o stumbles
The results might surprise you: Anthropic’s Claude 3.7 performed best. It mastered precise jumps, skillfully avoided opponents and acted confidently overall.
Even its predecessor Claude 3.5 did well, if not quite as impressively.
By contrast, things looked quite different for GPT-4o from OpenAI and Google’s Gemini 1.5 Pro. The models, which are actually known for their strong logical thinking skills, struggled.
They often failed at basic game mechanics and frequently jumped into gaps in an uncontrolled manner or were hit by opponents.
Timing beats logic
The test showed that quick reflexes are more important than complex logic – at least when it comes to playing Mario.
While some AI models try to think through situations, this approach has led to long delays.
After all, just a few milliseconds in Super Mario Bros. can make the difference between a successful jump and a failed attempt.
The researchers suspect that thinking
models like GPT-4o calculate too long before they act and therefore jump into the void.
Retro games as an AI benchmark?
Of course, the question remains as to how meaningful such tests are. An AI model that defeats the video game character Mario is not automatically suitable for complex tasks in the real world. Nevertheless, the experiment provides an exciting insight: not only computing power is decisive, but also fast, intuitive decisions.