“We can outperform OpenAI’s deep research and every leading model’s deep research quality,” Former Twitter CEO Parag Agrawal told TBPN as he announced Parallel’s new deep research API. Parallel looks to beat the giant general-purpose models at a capability that they are actively trying to perfect themselves. For now, however, the claim is premature.
His company’s launch just attracted $30 million from backers including Khosla Ventures, Index Ventures, and First Round Capital. The appeal is obvious, as deep research has quickly become one of the most valuable applications of AI, moving it towards the kind of multi-step reasoning that businesses, scientists, and investors actually rely on.
The argument makes sense on the surface. Agrawal explained the idea: “We’re building Parallel to build infrastructure for AI’s using the web. The web was built for humans. Two years ago, I started thinking AI… is going to be the primary user at a massive scale.”
By designing search, indexing and ranking for machines instead of browsers, more reliable factual signals get fed into language models. That is an engineering win Parallel can point to legitimately.
The launch materials and Parag’s posts do make concrete claims backed by numbers. The company says its Deep Research API outscored humans and leading models on hard web-research tests, and Parag has repeated those figures on LinkedIn and in press coverage. Their use of OpenAI’s BrowseComp benchmark, which measures how well agents find hard to locate web information, is relevant here. BrowseComp is meant specifically to test multi-step, persistent browsing behavior that ordinary chatbots do not always handle well.
Parallel is also not the only company trying to brand itself around deep research. In recent weeks, Thomson Reuters rolled out its own Deep Research tool aimed at transforming legal search and case analysis, while Manus announced a “Wide Research” product that deploys swarms of agents (hundreds at once) to crawl and synthesize web sources.
The problem is the breadth of the claim. Agrawal did not say “in our tests” or “on selected problems.” His claim was absolute. And that is the claim that needs third-party verification.
“Parallel exceeds the accuracy of humans working for 2 hours for just $0.10 per task.”
The company’s numbers look impressive, but they are self-reported and derived from a narrow set of tests and conditions. They point to specific benchmark runs and to demos showing better accuracy and citation behavior than other systems in those runs. When a startup claims superiority on a hard, newly defined task, the responsible next steps are to publish precise datasets, make evaluation code public, and invite neutral researchers or industry labs to reproduce the runs.
Third-party vetting isn’t unheard of in this space. Cognition’s “Devin” had its coding abilities checked on Princeton’s SWE-bench Verified, where patches are independently executed and scored by the benchmark maintainers, and reported a 13.86% solve rate on the Verified subset, a result reflected on the public leaderboard. In healthcare, Abridge’s ambient AI scribe has peer-reviewed evaluation from the University of Kansas Medical Center published this year, reporting significant improvements in documentation workflow and reduced after-hours work
There is also another practical reason for caution. The big model makers have been building their own deep research modes, and they also have an abundance of resources. Google’s Gemini now advertises a Deep Research capability that is explicitly designed to explore live web content and synthesize it into reports. That is the same area Parallel says it excels in. Google has deep control over search, and the massive compute and context windows that matter in these tasks. OpenAI has its own browsing benchmarks and specialized research agents too.
While OpenAI’s latest model has been the subject of mixed user reaction even as it posts strong internal numbers on many benchmarks, that messiness does not settle Parallel’s claim either way.
Parallel can’t be dismissed outright. Building a data retrieval system optimized for agents is an underexplored design point, and statements from Parag point to the fact that the company has healthy operating instincts: “At every price point, you have got to be the best. And for someone who has no price sensitivity, you have got to be the absolute best.”
The rest of the field, that is: independent benchmarkers, researchers or customers, will need to publicly confirm that Parallel’s Deep Research API consistently beats the likes of OpenAI and Google. Until then, Parallel is measured only against its own internal metrics and the public numbers from rivals.
 
								 
															 
				







