Protege CEO Bobby Samuels had seen the same problem surface across AI projects: promising model ideas stalled because the right data never arrived, or arrived too late. The most valuable datasets, embedded with years of human expertise, were often locked away in private archives entangled in technical barriers. In early 2024, he joined forces with Travis May, a veteran of data-sharing ventures like LiveRamp and Datavant, to build a business around removing those obstacles. “Access to the right training data continues to be the biggest bottleneck to AI’s progress,” Samuels said in the company’s recent funding announcement.
Protege just closed a $25 million Series A led by Footwork, with participation from CRV, Bloomberg Beta, Flex Capital, Shaper Capital, Liquid 2 Ventures, and others. The company says it grew revenue 20-fold in 2025, with tens of millions of dollars paid out to data partners through its platform. Nikhil Basu Trivedi, co-founder and general partner at Footwork, described Protege as “the connective tissue between proprietary data and cutting-edge AI,” citing its traction in healthcare, media, and “frontier AI labs.”
Structuring and Licensing Data for Hard-to-Fill AI Gaps
Protege’s platform serves as an intermediary, enabling data owners to make their proprietary datasets available for AI training under strict governance. The service handles the legal, technical, and operational steps that often delay or derail data-sharing agreements, from rights verification to structuring and annotation.
Since raising a $10 million seed round in 2024, Protege has built a catalog that includes over 300,000 hours of video, more than 500,000 hours of audio, billions of clinical notes, and hundreds of millions of medical images. The company now counts over 100 data partners, spanning healthcare networks to media archives.
Recent product expansions have targeted specific data shortages in AI. Just this month, Protege launched an Audio & Speech vertical to address gaps in multilingual, real-world conversational audio, and a Motion Capture vertical aimed at robotics and embodied AI systems. Both initiatives focus on structuring datasets for training, such as filtering audio for natural, multi-speaker conversations or annotating motion capture data with actor profiles and temporal alignment.
These launches complement earlier efforts in video through SHOT (Selected Highlights Optimized for AI Training), a suite of audiovisual datasets tailored for generative AI use cases like lip-syncing, weather scene generation, and human-object interaction modeling. Across verticals, the emphasis is on curating data that is both legally clear and technically optimized for model performance.
Strategic Position in the AI Ecosystem
The company’s approach appeals to two constituencies. For data owners, it offers a path to monetize assets without ceding control, in a market where proprietary archives have often sat idle due to legal or reputational concerns. For AI developers, it provides a source of diverse, high-quality training material that could improve performance on specialized tasks, from diagnosing medical conditions to handling code-switched speech.
Competitively, Protege operates in a fragmented landscape. Some companies specialize in individual modalities, such as medical imaging or geospatial data. Others, like Snowflake and Databricks, provide infrastructure for managing datasets but leave sourcing and licensing to customers. Protege’s model of aggregating, structuring, and licensing data across industries positions it differently: more akin to a specialized exchange than a pure infrastructure provider.
As AI adoption deepens across sectors, demand for high-quality, legally cleared training data is expected to grow. Protege’s long-term challenge will be balancing expansion into new domains with maintaining the governance standards that make its platform attractive to both sides of the market.








