Zendesk Thinks AI Customer Service Has a Measurement Crisis

AI customer service systems generate massive amounts of interaction data. The harder problem is deciding which signals actually matter.
Forrester found that Net Promoter Score fell in 20 out of 39 industry-country combinations in 2025, suggesting that customer expectations are outpacing service improvements. Yet 92% of contact centers have quality assurance programs in place. The problem isn't measurement. It's what happens after.
Dave Giblin, Head of Customer Success for Zendesk APAC, says to AIM Media House: "Measurement alone is observability. What changes the outcome is the loop between signal and fix, running continuously."
Zendesk announced Quality Score at its Relate conference in May, the ability to automatically measure 100% of customer conversations. The announcement arrives as enterprises struggle to evaluate AI-driven customer support at scale. Traditional quality assurance systems were designed around small samples of human conversations. AI agents changed the volume entirely.
What becomes visible when every interaction becomes measurable?
Call centers have sampled 2-5% of interactions forever. When teams were small, the math worked. You'd review a handful of calls, spot problems, coach agents, improve. The samples averaged out reasonably well.
AI agents handle thousands of conversations per day. A 2% sample of that volume is statistically meaningless. It represents 0.001% of what's actually happening. You can't spot patterns in invisible data.
More crucially, companies measure what's easy to measure. Response time. Volume deflected. Cost per interaction. These metrics look excellent on a spreadsheet. They don't measure what actually matters: whether the customer's problem got solved, whether the experience felt human, whether someone walked away satisfied or frustrated.
Klarna learned this in public.
What Klarna's Numbers Hid
In January 2024, Klarna deployed an OpenAI-powered chatbot across 23 markets. The move replaced 700 human agents with an AI assistant. The metrics were flawless.
In the first month alone, the chatbot handled 2.3 million conversations. Response times dropped from 11 minutes to 2 minutes, an 82% improvement. Repeat inquiries fell 25%. Customer satisfaction scores matched human agents, or so the data said.
Klarna projected a $40 million profit improvement for 2024. The company called itself "OpenAI's favorite guinea pig." The tech industry applauded.
By mid-2025, something had shifted. Customer satisfaction dropped 22%. Service quality became inconsistent. Complaints accumulated: robotic responses, unresolved issues, frustration. Klarna began quietly rehiring human agents.
CEO Sebastian Siemiatkowski later reflected on the strategy: "We focused too much on efficiency and cost. The result was lower quality, and that's not sustainable."
The AI system was efficient. It wasn't effective. Klarna measured deflection and speed (the easy metrics), not satisfaction and actual resolution. When the company finally had visibility into what was actually happening, the picture inverted entirely.
The lesson seems simple. What you measure is what you improve. What remains invisible gets worse. But here's where it gets complicated.
Measurement's Harder Problem
Full visibility into customer interactions solves one problem. It doesn't solve the next one. Measuring 100% of interactions means nothing without acting on what you find. Many companies will measure everything and improve nothing.
This gap between measurement and improvement is real. Research shows that 95% of call centers use quality assurance, yet very few managers report improved customer satisfaction from QA practices. More striking: 83% of agents believe their QA program doesn't help them improve.
Why does this gap exist? Most automated solutions prioritize quantity over quality, measuring superficial compliance rather than driving meaningful improvement. A single quality score doesn't teach an agent how to improve. When measurement becomes the target (when Quality Score becomes what matters instead of customer satisfaction), teams optimize for the metric instead of the outcome.
Automated systems can't always distinguish between an agent making a judgment call and breaking a rule. They can't understand context the way humans do. "You can only scope your requirements if you know what you're talking about and you only know what you're talking about when you've done the manual work," explains Dorien de Vreede, Head of Customer Support at bloomon.
What's actually required is what Giblin describes, that is, measuring 100% of interactions while tying every signal back to the cause. "A knowledge gap, a broken workflow, an AI procedure that needs tuning, a coaching moment. Teams can fix it, not just observe it," he says. That requires human judgment.
Trust in AI companies has collapsed. In 2019, 62% of people trusted AI companies globally. By 2024, that fell to 54%. The EU AI Act now requires continuous monitoring of AI systems.
Why did measurement take thirty years to solve? Giblin points to the reason: "For thirty years measurement was optimised for cost-to-serve, and small sample QA was good enough when humans handled the work." AI changes the volume and the failure modes. The old measurement model can't keep up.
Measurement is no longer optional. But compliance and improvement aren't the same thing. Companies can measure everything and still disappoint customers if they're measuring the wrong things or treating measurement as performance rather than insight.
At the same time, enterprises are deploying AI agents at scale without infrastructure to measure what they're actually doing. They are making Klarna's approach: prioritizing efficiency metrics over resolution. The gap between "metrics say this is working" and "customers know it isn't" widens every quarter.
The question now is whether companies can measure, act, and improve simultaneously. Zendesk's premise is sound: you can't fix what you don't see. The real work is deciding what to fix when visibility arrives.
The companies that succeed treat measurement as a starting point, not an ending point. The ones that measure without acting, or measure without understanding context, will discover what Klarna learned: metrics can lie.
Key Takeaways
- Recognize that AI customer service data is vast but challenging to interpret effectively.
- Understand that customer satisfaction metrics like Net Promoter Score are declining despite quality assurance programs.
- Implement continuous feedback loops to improve AI customer service outcomes beyond mere measurement.
- Utilize Zendesk's new Quality Score feature to automatically assess all customer interactions.
- Acknowledge the limitations of traditional quality assurance methods in the age of AI-driven support.