Your AI Journey, Your Pace: AI Solutions Built for Flexibility |

Measuring What Matters: Performance Metrics for Voice AI Agents in Healthcare

By: Zach O’Bea, Principal Product Manager, Artera

The rise of agentic AI is super exciting, but let’s be honest – it can also feel pretty overwhelming. This is uncharted territory for all of us. Large Language Models (LLMs) often seem like a “black box,” and the array of techniques available to monitor them only makes things more complicated.

As Principal Product Manager at Artera, my goal is to make this whole process less intimidating and more transparent. This is why I want to illustrate how Artera evaluates the performance of these non-deterministic, agentic systems, and what metrics we look for when measuring success.

By sharing what’s happening behind the scenes, we hope to inspire curiosity and encourage our customers to ask the tough questions. After all, the real “winners” in the AI space will be the ones who learn to be savvy, informed users of this technology.

Being a savvy user starts with understanding performance. Unlike traditional software, where success is black-and-white, Agentic AI lives in the gray areas of human conversation. This requires a framework that moves beyond technical uptime to measure the quality and empathy of the patient experience. 

Let’s break down the metrics we focus on below.

Speed Matters: Measuring Latency in Voice AI Healthcare Agents

In voice-based AI, silence creates friction. Just a few seconds of dead air can cause patients to hang up or lose trust. That’s why speed is such a big deal for us.

1. Time to First Token (TTFT)

This measures how quickly the AI responds after a patient says “Hello.” Think of it as the AI’s “hello back” moment. Similar to typing into a blank ChatGPT window, the first response tends to be the slowest due to a cold start – when the model loads its instructions and processes the initial request before replying. At Artera, we strive to keep response time latency to no more than 500 milliseconds (ms), as anything longer can feel awkward or even lead to patient frustration. 

2. Average Turn Latency

Once the conversation is rolling, we look at “turn latency.” This measures how long it takes for the AI to reply after the patient finishes speaking. If this suddenly gets slower, it usually means there’s a system issue, like a sluggish API or a tricky query. We track these hiccups closely and get real-time alerts so we can jump in and potentially fix things quickly, keeping conversations smooth and frustration-free.

Quality of Life (QoL) Metrics: AI Workflow Reliability & Task Success Metrics

Speed is important, but it means little if the AI fails to get the job done. This is where “quality of life” metrics become crucial, as they’re specifically designed to measure success within the context of a particular workflow. By asking questions like, “Is the AI completing tasks accurately, or is it creating errors?” we can gather the data needed to fine-tune our systems. 

3. Patient Identification Success Rate

Let’s look at a scheduling workflow, which is one of our most in-demand use cases. To get a patient on the books, the AI must know exactly who it’s talking to. It’ll ask for a name and date of birth, then double-check those details against the EMR. We track how often the AI nails this step – basically its “win rate.” When this goes smoothly, the rest of the call ideally falls into place. But if that success rate starts to dip? That’s a red flag that something in the initial verification step needs to be investigated. 

4. Tool Success Rate

We also keep a close eye on the “health” of the dozens of various tools baked into each workflow. By monitoring every success and failure, we can jump on issues the moment they happen, like if an agent suddenly can’t book an appointment. If things go sideways, we know immediately that an investigation is required to understand if it’s an EHR outage, a weird formatting glitch, or some other technical hiccup.

Patient Experience Metrics

Even if the AI is fast and our tools are working correctly, we want to have more qualitative analysis of the experience of a patient while they’re on a call with our agent. Measuring this is tricky but essential…that’s why we include a metric in our “LLM as a judge” that’s all about patient experience

5. Patient Experience Score

To gauge the patient experience score, we use an internal agent to review patient conversations and give them a score from 1 to 3:

  • 3 (Excellent): The interaction with a patient was smooth, easy, and completed without issues. 
  • 2 (Good): The task was done, but there were minor hiccups, like asking the patient to repeat themselves.
  • 1 (Poor): The call went off the rails—confusion, frustration, or the need to escalate to a human.

Our goal? Maximize “3” scores and drive “1” scores to zero. This gives us actionable insights into how patients feel about their interactions and helps us fine-tune the AI.

Other Metrics We’re Tracking (and Iterating On) 

  • Average Length of Calls (in minutes): Sometimes a long call is legitimate, sometimes it’s a sign of an underlying issue
  • Count of Agent Conversations (number of inbound / outbound conversations): important for customers to know how many calls their agent is handling each day 
  • Call Outcomes: Determines whether a session was “successful” – this is currently defined as those where the workflow was completed autonomously without human intervention. We’re continuing to refine this, as there are nuances to certain handoffs.
  • Workflow adherence: This is how closely an agent follows a given set of instructions. Using an LLM as judge, this metric determines whether the agent deviated from the specified workflow and to what extent.
  • Handoff Reason Analysis: We provide a detailed examination of why agents escalate interactions to human representatives, identifying the specific factors that necessitate a handoff.

In the future, we plan to enhance our analytics capabilities by introducing session sentiment analysis and allowing for different agent use cases to have their own workflow-specific rubrics and metrics. Soon, our customers will be able to review both audio and text files from conversations to identify tone of voice and patient inflection, unlocking a new level of conversational intelligence.

Turning Metrics Into Action: How AI Performance Data Drives Better Healthcare Outcomes

Evaluating these metrics is the key to making things better. When something goes wrong, they help us spot and fix it fast. 

By analyzing agent-patient conversations with our LLM as judge, we gain both quantitative and qualitative insights into our agents’ performance. This allows us to pinpoint specific areas for improvement. Manually reviewing every transcript is impractical at scale, so these techniques are essential for efficiently detecting agent hallucinations and identifying trends in technical failures or inconsistencies. Most importantly, this process provides our customers with the confidence that our agents are performing as promised.

If you’re a healthcare leader thinking about Agentic AI, here’s my advice: don’t stop at the demo. Ask for the data. Look into how performance is tracked and monitored. With the right metrics, you can ensure that your AI agents truly deliver and transform the patient experience.

Related Posts

By: Keith Dutton, Vice President, Engineering, and Andrew Hwang, Engineering Manager, Machine Learning When people think of AI agents, they...
As a health system executive, you’re likely at a crossroads that could define your company’s competitive advantage for the next...
By: Cassie Pena, Senior Director, Product Management, and Simon Williams, Manager, Integration Engineering Organizations are quickly adopting AI agents for...
Connect with Us