While text and image models dominate the generative-AI conversation, voice is quickly emerging as an enterprise-scale medium.
ElevenLabs, the London-based voice-AI company founded in 2022, has begun positioning its technology as core infrastructure for customer engagement and creative production.
Looking at the implications of the company’s latest announcements, voice is no longer a feature of AI systems, but an asset that enterprises can own, license, and deploy.
Voice becomes enterprise infrastructure
At its first company summit this month, ElevenLabs announced enterprise partnerships with Square and MasterClass. The company said its voice-agent platform now powers Square’s AI-powered voice ordering, enabling restaurants to answer phone calls with synthetic voices that can take and customize orders during peak hours.
For MasterClass, ElevenLabs’ technology provides voice-cloned coaches inside MasterClass On Call, allowing users to interact conversationally with instructors modeled on real public figures such as Gordon Ramsay and Mark Cuban. The partnerships expand ElevenLabs’ presence from consumer creators into enterprise deployments.
At the same event, ElevenLabs introduced the Iconic Voice Marketplace, where companies can license synthetic versions of celebrity voices. Actors Michael Caine and Matthew McConaughey were announced as the first participants: Caine’s licensed voice will appear in audiobook narration through the ElevenReader app, while McConaughey is using the platform to release a Spanish-language audio edition of his Lyrics of Livin’ newsletter.
CEO Mati Staniszewski described these moves as part of a push to make “voice essential infrastructure for AI agents and experiences”. In that interview he said enterprise customers now account for roughly half of the company’s revenue: up from about 10 percent a year earlier, and cited deployments such as Fortnite’s conversational Darth Vader, created in partnership with the estate of James Earl Jones.
ElevenLabs claims its platform now serves millions of users and thousands of businesses, including employees “from over 75 percent of the Fortune 500,” and that its models can generate interactive voices in more than 30 languages. The company says it has expanded its voice library to over 10,000 voices, a scale it links to rapid revenue growth—industry trackers estimate more than $200 million in annual recurring revenue and a valuation of about $6.6 billion as of late 2025.
Staniszewski has emphasized that growth will depend on safety and provenance features. ElevenLabs built a classifier that detects AI-generated audio, moderation layers that screen text and voice content, and provenance tracing that links every clip to its source account. These measures are meant to reassure enterprise clients wary of deep-fake or impersonation risks.
The race to own the voice stack
The timing of ElevenLabs’ expansion coincides with OpenAI’s entry into the same domain. In March 2025, OpenAI released next-generation audio models for developers, integrating speech-to-text and text-to-speech directly into its API to enable “real-time conversational agents”. The company’s earlier Voice Engine preview showed that a natural-sounding clone could be produced from a 15-second sample, raising both commercial and ethical questions.
OpenAI’s advantage is multimodality: voice joins text, image, and video in a unified model ecosystem. For enterprises, that breadth simplifies integration but limits control. ElevenLabs, by contrast, offers narrower scope but deeper specialization: expressive quality, multilingual fidelity, and the option to license or own voices as IP assets.
Specialist competitors are also pressing in. PolyAI, based in London, builds conversational voice agents for call-center deployments and claims to handle “millions of real customer calls each day” for clients like FedEx and BMW. Resemble AI markets on-premises voice cloning that allows enterprises to keep data within their own infrastructure, appealing to industries with strict compliance needs. Synthflow AI, which raised $20 million in mid-2025, offers real-time voice-agent deployment with sub-400 millisecond latency for call-automation use cases.
Major cloud platforms are extending their own speech stacks: Microsoft’s MAI-Voice integrates with Copilot applications for enterprise clients; Google Cloud’s Text-to-Speech v2 leverages WaveNet to produce multilingual conversational agents; and Amazon’s Polly continues to expand within AWS Bedrock for generative-AI orchestration.
For enterprises choosing among these vendors, the trade-offs are clear. Platform giants deliver scale, compliance, and integration across modalities. Niche players like ElevenLabs and Resemble offer finer-grained control over model weights, licensing rights, and data residency.
What the shift means
The voice-AI market illustrates a broader trend in enterprise technology: companies are moving from renting general-purpose models to owning customized models trained on their own IP, voices, and brand assets. ElevenLabs’ licensing marketplace shows how a model’s output (the voice itself) can become proprietary. That notion parallels similar movements in image and language modeling, where enterprises are seeking control for compliance, consistency, and differentiation.
Operationally, deploying voice at enterprise scale introduces new demands: latency under one second for natural conversation, reliable telephony integration, multilingual performance, and human fallback for failed interactions. Governance adds another layer: tracking consent, licensing rights, and synthetic-voice provenance to prevent misuse. ElevenLabs and OpenAI both now publicize safety frameworks that address these issues, signaling that trust will be as important as technical performance.
ElevenLabs’ bet is that voice is the next model asset, and that enterprises will want to own the sound of their brands as surely as they own their logos or taglines.








