“Available multimodal models are still unable to perform tasks that require interactive planning and orientation in a dynamic environment. This conclusion was reached by researchers at the University of Printon at VideoGamebench. Gemini 2.5 Pro plays in Kirby’s Dream Land in real time. Data: Videogamebench. Scientists have checked the GEMINI 2.5 PRO, GPT-4O, LLAMA 4, GEMINI 2.0 FLASH and CLAUDE […]”, – WRITE: Businessua.com.ua

Available multimodal models are still unable to perform tasks that require interactive planning and orientation in a dynamic environment. Such a conclusion came Researchers at Princeton University in Videogamebench.
Gemini 2.5 Pro plays in Kirby’s Dream Land in real time. Data: Videogamebench.
Scientists have checked the GEMINI 2.5 PRO, GPT-4O, LLAMA 4, Gemini 2.0 Flash and Claude 3.7 Sonnet in 10 popular 2Ds of the late 90’s-from Super Mario to Age of Empires. Conditions: Access only to the game video and a brief description of management and goals.
Videogamebench test scheme. Data: arxiv.org.
The best real -time result is only 0.48% of success shown by Gemini 2.5 Pro. In the simplified Lite mode, where the game stops before each action, the result is slightly higher – 1.6%.
Videogamebench test split performance, consisting of 10 games. Each score is displayed as a percentage of the game on the basis of the passed checkpoints, ie 0% means that the agent has not reached the first checkpoint. The total score is calculated as arithmetic mean in all games. Data: arxiv.org.
Unlike text tasks, games require not only image recognition, but also fast solutions, spatial memory, long -term planning and adaptation to changing conditions. Delays in the inference even in the most modern VLM models do not allow them to act in real time, especially in arcade or strategic Tella.
“Models cannot understand a simple instruction such as” Turn on the Mill “, even having tips on the screen,” the authors of the study said.
According to them, even the basic logic of the game world (such as what water needed for food) has been too complicated for modern VLM.
You can read the code and examples of passing on the official site Videogamebench and GitHub.
We will remind, specialists of Palisade Research recorded attempts “Self-preservation” in several Shi models.
The gun
Please wait …