← Back to all guides

Voice Evaluation — Test & Score AI Voice Agents with Automated Judges

Langoedge Team5 min read

What is Voice Evaluation?

Langoedge provides a professional suite for testing and scoring your AI voice agents using automated quality judges. This system eliminates the need for manual call listening by analyzing real-world conversation transcripts at scale.

Definition — Voice Evaluation: An automated process where LLM-powered "judges" review voice call transcripts against your defined quality criteria — measuring empathy, accuracy, compliance, and task completion.

The Core Pillars

- **Post-Call Analysis:** Every call is recorded and stored for later review. - **LLM-as-Judge:** Automated judges score transcripts against your custom criteria. - **Observability:** Track **Pass Rate** and **Quality Scores** over time across different agent versions.


How to Run an Evaluation

Step 1: Define Your Evaluators (Judges)

An evaluator is an LLM instruction set that scores a call transcript.

Go to the Evals tab in the main header of your Voice Graph and select the Evaluators tab.

Each evaluator has three components:

Setting Description Example
Metric Type Category for the judge's focus area "Empathy", "Task Completion", "Compliance"
System Instructions Instructions for the judge. (Note: Do not include {transcript} placeholders; the history is appended automatically) "Verify if the agent mentioned the user's name."
Threshold Minimum score (0.0 to 1.0) for a "Pass" 0.7 means score must be 70%+

Example Prompt:

"Score 90 if the agent successfully booked the appointment, 50 if they tried but failed, and 10 if they were rude or unhelpful."

Step 2: Select a Session from Evals History

[!WARNING]
Deprecated: Scenarios & Synthetic Simulations. The ability to run synthetic simulated conversations as evaluation inputs has been removed. Evaluations now operate exclusively on real call sessions captured in production.

To evaluate your agent:

  1. Go to the Evals tab in the main header and select the secondary Evals tab.
  2. Browse the chronological list of all calls handled by the agent in the sidebar.
  3. Select a session that is interesting, representative, or contains edge cases.

Step 3: Launch Evaluation

Click Run Intelligence Suite in the top right corner of the selected session view.

The system will:

  1. Fetch the recorded Chat History (including the system context that was active during the call).
  2. Trigger your selected judges to score the conversation.
  3. Display results in the details pane with scores, pass/fail status, and reasoning.

Interpreting Results

After an evaluation completes, you'll see:

Metric What It Means
Overall Score The mean score across all active judges for that call.
Pass Rate The percentage of judges whose scores met their individual thresholds.
Score Breakdown Each judge provides a numeric score plus free-text reasoning explaining why that score was given.

Key Takeaway: The free-text reasoning is invaluable. It tells you not just that an agent failed, but why — enabling targeted prompt improvements rather than random trial and error.


Best Practices for Effective Testing

Monitor Edge Cases

Search call logs for sessions where users expressed frustration, confusion, or disconnection. These are the scenarios where your agent is most likely to fail.

Iterate on Judge Prompts

If a judge is too lenient, add specific "negative examples" to its configuration prompt. If too strict, add positive examples.

Track Over Time

Run evaluations regularly on new calls. Plot Overall Score and Pass Rate over time to see if agent improvements are actually working.

Use Multiple Judges

Create separate judges for different concerns: one for empathy, one for task completion, one for compliance. This gives you multidimensional visibility.


Here are proven evaluator prompts you can adapt for your use case (note that they should instruct the judge to score on a scale of 0 to 100):

Task Completion Judge

"Did the agent successfully complete the user's primary request? Score 100 if fully completed, 50 if partially completed, 10 if not attempted or failed."

Empathy & Tone Judge

"Was the agent empathetic, patient, and professional throughout? Score based on language used, acknowledgment of user emotions, and tone consistency. Score 100 for excellent empathy, 50 for neutral, 10 for rude or dismissive."

Compliance Judge

"Did the agent make any false promises, share unverified information, or violate compliance rules? Score 100 for full compliance, 0 for any violation detected."

Handoff Quality Judge

"When the agent transferred the call to a human or another department, was the handoff smooth? Did the agent provide context to the recipient? Score from 0 to 100 based on transition quality."


Cost & Performance

Cost Note: Evaluations only use LLM tokens for the final "Judge" assessment — not for simulating conversations. This makes evaluations highly cost-effective compared to synthetic testing approaches, as they leverage real-world data already captured during production calls.

Each evaluation runs independently and does not affect your voice agent's performance or call handling.


Frequently Asked Questions

How much do evaluations cost?
Evaluations only consume LLM tokens for the judge assessment step, not for running simulated conversations. A single evaluation of a typical 5-minute call transcript costs only a few cents in LLM tokens.
Can I create custom evaluation criteria?
Yes. Each evaluator is defined by a prompt template where you control the scoring criteria. You can create unlimited evaluators for any metric — empathy, accuracy, compliance, brand voice, etc.
Do I need to include a transcript placeholder in my prompt?
No. The evaluation engine automatically appends the conversation transcript to the end of your evaluator prompt at runtime. You do not need to include any `{transcript}` placeholders in your custom prompt.
How often should I run evaluations?
We recommend running evaluations on a weekly basis or after any significant change to your voice agent's prompts, model selection, or node configuration.

LT

Langoedge Team

The Langoedge engineering team builds AI agent infrastructure that empowers businesses to deploy reliable, observable AI staff. Follow Langoedge Team on LinkedIn for product updates and architectural deep dives.