Are Positron's large models losing to small AI?

By AIM | July 30, 2025 | 5 min read

As enterprises embrace small language models, Positron’s chips may face a shrinking future

Positron’s message to the AI world is this: the language models of the future will be massive, and they’ll need inference chips like Atlas to run.

The startup, founded in 2023, makes an inference accelerator it says blows past Nvidia’s GPUs in both energy and cost efficiency: “2x to 5x performance per watt and dollar,” according to CEO Mitesh Agrawal. Atlas is air-cooled, runs Nvidia-trained models without any code rewrites, and was shipped into production just 15 months after launch. Early customers include Cloudflare and Parasail.

That’s fast execution. Investors seem impressed: Positron just closed a $51.6 million Series A led by Valor Equity, Atreides, and DFJ. Next on the roadmap is Titan, a 2026 system built around Positron’s custom “Asimov” ASICs. The claim? That a single 8-card Titan server will eventually be able to serve a 16-trillion-parameter model. That’s five times the size of GPT-4, and potentially more than any model yet publicly confirmed.

But Positron’s bet, that the frontier will keep expanding and that customers will need air-cooled inference hardware for multi-trillion-parameter models, runs counter to an accelerating trend: AI is getting smaller.

Just a year ago, there was broad consensus that bigger was better. But as the market matures, a lot of companies are finding that small is, in fact, good enough.

“For a lot of tasks, an 8‑billion-parameter model is actually pretty good,” said Carnegie Mellon’s Zico Kolter in an interview with Wired. “And it can run on a laptop or cell phone instead of a huge data center.”

This idea is no longer fringe. Meta’s LLaMA 3 (8B), Microsoft’s 3B model, Google’s Gemini Nano, and Phi-3 from Microsoft are all sub-10B models built for local and low-cost inference. Red Hat published that models like Mistral-7B and LLaMA 3 “deliver comparable accuracy to LLMs with lower costs, enhanced data privacy, and simplified deployment.” Enterprises are taking notice.

Why? Because small models are cheaper to run, easier to deploy, and often “good enough” for real-world jobs like summarization and document classification. They require less hardware, reduce latency, and avoid GPU bottlenecks.

Some companies now fine-tune open-source small models for niche tasks and deploy them on commodity cloud instances or even edge devices. A few are even running 7B models on phones. Open-source tooling has caught up fast, and techniques like quantization and distillation mean these models can be shrunk further while preserving quality.

All of this points to a future where many workloads, especially customer-facing or mobile ones, lean on small, efficient models rather than trillion-parameter giants.

So what does that mean for Positron?

The company seems aware of the split. CEO Agrawal has acknowledged the “duality” in the market: small models on devices, big ones in the cloud. In Positron’s view, we’ll eventually need both, with frontier models providing deep reasoning and on-device models offering fast, lightweight tasks.

But that vision depends on where the money flows. And increasingly, it's flowing toward use cases where big-model inference is either too slow, too expensive, or simply overkill.

Positron's Atlas chip excels in performance-per-watt on large models. In internal benchmarks on LLaMA 3 8B, an Atlas rack reportedly delivers 280 tokens/sec at 2,000 watts, versus 180 tokens/sec at nearly 6,000 watts for Nvidia’s top-tier H200 DGX setup. That’s about 3x the efficiency.

Atlas is also easier to deploy: it works in traditional air-cooled data centers, whereas Nvidia’s new Blackwell GPUs will demand liquid cooling. That could give Positron a clear advantage in edge or legacy environments where data centers don’t have specialized infrastructure.

Still, even that might not be enough. Nvidia’s grip on the software ecosystem, particularly CUDA, means that switching chips isn’t just a hardware decision. It’s a systems decision. CUDA is entrenched in model training, deployment tools, and devops pipelines.

Startups like Groq, which also chase the inference market, have stumbled at the scaling phase. According to The Information, Groq cut its 2025 revenue forecast from $2 billion to $500 million. Even Cerebras, which builds some of the largest AI chips in the world, has pivoted its marketing to emphasize price competitiveness over raw size.

Small models keep improving, and competitors like Nvidia and AMD continue to release ever-denser, better-performing inference chips. Positron looks to find itself stuck in the middle: too big for the fast-and-light crowd, not entrenched enough to unseat Nvidia in the hyperscale market.

To succeed, Positron needs to sell its strengths: power efficiency, ease of installation, and cost per inference. It needs repeatable wins outside the bleeding edge: think legacy industries like insurance, logistics, healthcare. It needs to own the space that entrenched competitors can’t, or won’t, reach.

The risk is that that space is shrinking, in both senses:

As models go small, fast, and ubiquitous, the need for trillion-parameter inference infrastructure is poised to turn out to be more niche than Positron hopes.

Key Takeaways

Positron developed Atlas inference chips, claiming superior energy and cost efficiency over Nvidia for large AI models.
Despite rapid execution and significant funding, Positron's strategy of focusing on massive models faces market shift to smaller AI.
The AI industry is increasingly favoring smaller, more efficient language models, challenging Positron's large-model chip bet.

Are Positron's large models losing to small AI?

Key Takeaways

Related Articles