Probably the greatest video game classic of all time is almost 40 years old and is now helping modern research: Super Mario meets AI

0
20

Not a new standard, but still exciting: Mario as a benchmark for AIs.

Benchmarks for AI models are often dry: mathematics, logic tests, complex data analyses. But researchers at UC San Diego have taken a new approach – and simply let their AIs Super Mario Bros play, as TechSpot reports.

Sounds like a curious experiment? Perhaps. But it certainly shows that timing is sometimes more important than pure computing power.

The experiment: GamingAgent as an AI controller

The researchers at the Hao AI Lab at the University of California San Diego developed the GamingAgent framework (GitHub), which allows AI models to control plumber Mario via Python code.

An emulated version of Super Mario Bros. on the NES served as the basis. The AIs were given simple instructions such as Jump over this enemy as well as screenshots for orientation.

The goal was to find out how well the models could plan their actions and adapt them in real time.

Claude 3.7 dominates – GPT-4o stumbles

The results might surprise you: Anthropic’s Claude 3.7 performed best. It mastered precise jumps, skillfully avoided opponents and acted confidently overall.

Even its predecessor Claude 3.5 did well, if not quite as impressively.

By contrast, things looked quite different for GPT-4o from OpenAI and Google’s Gemini 1.5 Pro. The models, which are actually known for their strong logical thinking skills, struggled.

They often failed at basic game mechanics and frequently jumped into gaps in an uncontrolled manner or were hit by opponents.

Timing beats logic

The test showed that quick reflexes are more important than complex logic – at least when it comes to playing Mario.

While some AI models try to think through situations, this approach has led to long delays.

After all, just a few milliseconds in Super Mario Bros. can make the difference between a successful jump and a failed attempt.

The researchers suspect that thinking models like GPT-4o calculate too long before they act and therefore jump into the void.

Retro games as an AI benchmark?

Of course, the question remains as to how meaningful such tests are. An AI model that defeats the video game character Mario is not automatically suitable for complex tasks in the real world. Nevertheless, the experiment provides an exciting insight: not only computing power is decisive, but also fast, intuitive decisions.