Google DeepMind Launches SIMA 2, An AI Agent That Thinks, Plans, and Learns Across Virtual Worlds

"It's a more general agent. It can complete complex tasks in previously unseen environments. And it's a self-improving agent."​

For years, AI agents have been remarkably brittle. Train them on a specific task in a specific environment, and they excel. Move them to a new context, and they fail. This is exactly why the robotics industry has remained largely dependent on human engineering and specialized task-specific training. 

But Google DeepMind just launched something fundamentally different. SIMA 2, the second generation of its Scalable Instructable Multiworld Agent marks a shift from rigid instruction-followers to general-purpose reasoners. And the implications extend far beyond gaming.​

The original SIMA, released in March 2024, was impressive for its time. It could follow over 600 language-based instructions across diverse virtual environments, everything from “turn left” to “open the map.” But it had a critical limitation. It mapped pixels directly to actions. No reasoning, planning, or understanding of context.​

SIMA 2 changes everything by integrating Gemini 2.5 Flash Lite as its core reasoning engine. This isn’t a minor tweak. It’s a fundamental architectural shift. As DeepMind senior research scientist Joe Marino explained, “SIMA 2 is a step change and improvement in capabilities over SIMA 1. It’s a more general agent. It can complete complex tasks in previously unseen environments. And it’s a self-improving agent.”​

Now when you ask SIMA 2 to “build a shelter” or “find the red house,” the agent doesn’t immediately fumble around. Instead, it thinks. It breaks the goal into component steps. It can even explain its reasoning, answer questions about its objectives, and justify its decisions. This interpretability matters enormously for debugging and trust.​

Performance 

The numbers tell the story. SIMA 1 achieved roughly 31% task completion on complex challenges. SIMA 2 reaches 62%, nearly doubling performance and approaching human-level capability (around 70%). But the really impressive part isn’t the benchmark improvement. It’s what happens in environments SIMA 2 has never seen.​

When tested in completely novel games like ASKA (a Viking survival game) and MineDojo (a research implementation of Minecraft), SIMA 2 achieved 45-75% task completion, compared to just 15-30% for SIMA 1. This isn’t memorization. This is genuine zero-shot generalization, the ability to transfer learned concepts to entirely new contexts.​

What makes this possible? The agent learns abstract concepts, not just pixel patterns. When SIMA 2 understands “mining” in one game, it can apply that knowledge to “harvesting” in another. 

When asked to find a house “the color of a ripe tomato,” Gemini’s reasoning engine deduces that ripe tomatoes are red, then identifies and navigates to the correct target. This is human-like problem-solving, abstract association meeting environmental observation.​

SIMA 2 accepts instructions in almost any form. Text. Speech. Sketches. Emojis. Multiple languages. This isn’t gimmicky, it matters for real-world deployment where commands won’t always come in perfectly formatted text.​

The system grounds these diverse input modalities in a shared representation, linking text, audio, images, and in-game actions into one coherent framework. A user could sketch a path on the screen, describe it in Spanish, and punctuate the instruction with emojis, and SIMA 2 would understand all of it equally.​

The Self-Improvement Loop

Perhaps the most consequential innovation is SIMA 2’s built-in self-improvement mechanism. After an initial training phase using human gameplay as a baseline, DeepMind moves the agent into new games and lets it learn exclusively from its own experience.​ A separate Gemini model generates tasks for the agent. A learned reward model scores each attempt. 

These self-generated trajectories are stored and fed back into training for the next generation of SIMA 2. Later versions use this data to succeed on tasks where earlier generations failed, all without any fresh human demonstrations.​

This is model-in-the-loop learning at scale. It dramatically reduces dependence on expensive human-labeled data while enabling continuous improvement. It’s the kind of self-directed learning that will eventually power general-purpose robots.​

DeepMind has been explicit. SIMA 2 isn’t meant as a gaming assistant. It’s a proving ground for embodied AI that will eventually control real robots.​

Embodied agents, systems that interact with physical or virtual worlds through sensory input and motor output are fundamentally different from code-executing assistants or calendar-managing bots. They must perceive, reason, and act in real time. 

SIMA 2 does this in virtual worlds using keyboard and mouse controls. The skills transfer directly to robotics: navigation, tool use, collaborative task execution, obstacle avoidance, multi-step planning.​

When combined with DeepMind’s Genie 3 (a world model that generates interactive 3D environments from text or images), SIMA 2 demonstrates unprecedented adaptability. It can navigate previously unseen worlds, identify objects, and execute coherent goal-directed actions in environments it’s never encountered. This is the precursor to robots operating in unstructured physical spaces.​

SIMA 2 represents a meaningful system milestone, not just a benchmark victory. By embedding Gemini at its core, DeepMind has demonstrated a practical recipe for general-purpose embodied agents. Multimodal perception, language-based planning, self-improvement loops, validated in both commercial games and AI-generated environments.​

The limitations remain real, long-term memory, extremely complex multi-step reasoning, and precise low-level control still pose challenges. But the trajectory is unmistakable. The gap between virtual agents and physical robots just got dramatically smaller.​

📣 Want to advertise in AIM Media House? Book here >

Picture of Sachin Mohan
Sachin Mohan
Sachin is a Senior Content Writer at AIM Media House. He is a tech enthusiast and holds a very keen interest in emerging technologies and how they fare in the current market. He can be reached at sachin.mohan@aimmediahouse.com
Global leaders, intimate gatherings, bold visions for AI.
CDO Vision is a premier, year-round networking initiative connecting top Chief
Data Officers (CDOs) & Enterprise AI Leaders across major cities worldwide.

Subscribe to our Newsletter: AIM Research’s most stimulating intellectual contributions on matters molding the future of AI and Data.