Steering Agents is an Unsolved Problem, says OpenAI’s Alexander Embiricos

Codex proposes code changes at scale, but engineering teams still decide what gets merged

When OpenAI introduced Codex in August 2021 it was framed as a model that translates natural language into code. Today’s Codex has evolved: product lead Alexander Embiricos tells the AI+16z podcast that the product is meant to be “an agent working remotely”. A cloud teammate that can perform tasks in sandboxed environments and propose pull requests for review.

Embiricos describes a development arc from small, single-file autocompletions to multi-file reasoning and background execution. “You can give Codex a description, and it will generate a pull request on its own,” he said, noting the product’s cloud-agent design and the safety tradeoffs that informed it.  OpenAI documents that each Codex task runs in its own sandboxed environment and that agents operate in repo-scoped cloud sandboxes.

What developers actually use Codex for

Embiricos says the dominant use case so far is feature development. “By far the thing that people use Codex for is building new features,” he told AI+16z, adding that users appreciate the speed to first prototype.  That aligns with independent reporting: Engine Labs called Codex “functional, clean, and narrowly focused,” and said it accelerates prototyping while leaving some rough edges for teams using non-GitHub tooling.

Concrete usage numbers help explain developer behavior: Embiricos noted that “Codex has opened like 400k PRs since launch in like 34 days … and it’s merged like 350-something K of those PRs”: a claimed merge rate in the 80% range, which he links to Codex’s particular cloud-agent form factor and its pattern of showing work before opening PRs. That form factor intentionally trades automatic pushes for reviewable drafts to reduce operational risk. 

Third-party experimental evidence supports productivity gains for similar tools: a controlled trial of GitHub Copilot found treated developers completed a web-server task roughly 55% faster than controls. At the same time, security studies have flagged real downsides: one user study found participants with access to AI code assistants produced significantly less secure code on certain security tasks.

How Codex stacks up in a crowded field

Embiricos acknowledges there are “lots of good AI systems out there” and that the market is competitive. Competitors range from Google and Anthropic to startups like Cursor and Tabnine; recent entrants include xAI’s new agentic coding model, Grok-Code-Fast-1. The broader review coverage shows Codex is strong for experienced developers and enterprise flows but lacks the “bells and whistles” some vibe-coding users expect

What may matter most is integration and workflow. Embiricos and OpenAI emphasize that Codex’s value comes from fitting into dev environments, preserving project context across cloud and local sessions, and making reviewable PRs rather than blind pushes: features that affect merge rates and operational trust. 

Governance, trust, and practical adoption

Embiricos is explicit about limits: delegating work to agents creates new safety and product problems (steering, test coverage, and how humans specify tradeoffs). “An unsolved problem … is how do we steer agents that are working independently,” he said, noting that humans often don’t know the exact solution they want and that agent proposals should be curated, not blindly trusted. 

Policy and security guidance is catching up. OpenAI’s product page explains sandboxing and approval modes. Independent reviews stress guardrails: productivity gains are real, but firms must adopt CI, threat modeling, automated testing, and legal review to avoid insecure or legally ambiguous outputs. For regulated or high-risk systems, the choice to adopt Codex-like agents should be driven by assurance, that is reproducibility, provenance of training data, and formal security checks, and not just speed.

Codex’s jump from autocomplete to a stateful cloud agent is real and consequential. The AI+16z interview with Embiricos makes the case that the tool speeds prototyping and can take on repeatable engineering tasks, but only when organizations pair it with rigorous review, testing, and governance. The business decision now is simple, if uncomfortable: adopt the speed, but don’t surrender the controls. 

📣 Want to advertise in AIM Media House? Book here >

Picture of Mukundan Sivaraj
Mukundan Sivaraj
Mukundan covers the AI startup ecosystem for AIM Media House. Reach out to him at mukundan.sivaraj@aimmediahouse.com.
14 of Nov. 2025
The Biggest Exclusive Gathering of
CDOs & AI Leaders In United States

Subscribe to our Newsletter: AIM Research’s most stimulating intellectual contributions on matters molding the future of AI and Data.