Voice Evaluation — Test & Score AI Voice Agents with Automated Judges
What is Voice Evaluation?
Langoedge provides a professional suite for testing and scoring your AI voice agents using automated quality judges. This system eliminates the need for manual call listening by analyzing real-world conversation transcripts at scale.
Definition — Voice Evaluation: An automated process where LLM-powered "judges" review voice call transcripts against your defined quality criteria — measuring empathy, accuracy, compliance, and task completion.
The Core Pillars
- **Post-Call Analysis:** Every call is recorded and stored for later review. - **LLM-as-Judge:** Automated judges score transcripts against your custom criteria. - **Observability:** Track **Pass Rate** and **Quality Scores** over time across different agent versions.
How to Run an Evaluation
Step 1: Define Your Evaluators (Judges)
An evaluator is an LLM instruction set that scores a call transcript.
Go to the Evals tab in the main header of your Voice Graph and select the Evaluators tab.
Each evaluator has three components:
| Setting | Description | Example |
|---|---|---|
| Metric Type | Category for the judge's focus area | "Empathy", "Task Completion", "Compliance" |
| System Instructions | Instructions for the judge. (Note: Do not include {transcript} placeholders; the history is appended automatically) |
"Verify if the agent mentioned the user's name." |
| Threshold | Minimum score (0.0 to 1.0) for a "Pass" | 0.7 means score must be 70%+ |
Example Prompt:
"Score 90 if the agent successfully booked the appointment, 50 if they tried but failed, and 10 if they were rude or unhelpful."
Step 2: Select a Session from Evals History
[!WARNING]
Deprecated: Scenarios & Synthetic Simulations. The ability to run synthetic simulated conversations as evaluation inputs has been removed. Evaluations now operate exclusively on real call sessions captured in production.
To evaluate your agent:
- Go to the Evals tab in the main header and select the secondary Evals tab.
- Browse the chronological list of all calls handled by the agent in the sidebar.
- Select a session that is interesting, representative, or contains edge cases.
Step 3: Launch Evaluation
Click Run Intelligence Suite in the top right corner of the selected session view.
The system will:
- Fetch the recorded Chat History (including the system context that was active during the call).
- Trigger your selected judges to score the conversation.
- Display results in the details pane with scores, pass/fail status, and reasoning.
Interpreting Results
After an evaluation completes, you'll see:
| Metric | What It Means |
|---|---|
| Overall Score | The mean score across all active judges for that call. |
| Pass Rate | The percentage of judges whose scores met their individual thresholds. |
| Score Breakdown | Each judge provides a numeric score plus free-text reasoning explaining why that score was given. |
Key Takeaway: The free-text reasoning is invaluable. It tells you not just that an agent failed, but why — enabling targeted prompt improvements rather than random trial and error.
Best Practices for Effective Testing
Monitor Edge Cases
Search call logs for sessions where users expressed frustration, confusion, or disconnection. These are the scenarios where your agent is most likely to fail.
Iterate on Judge Prompts
If a judge is too lenient, add specific "negative examples" to its configuration prompt. If too strict, add positive examples.
Track Over Time
Run evaluations regularly on new calls. Plot Overall Score and Pass Rate over time to see if agent improvements are actually working.
Use Multiple Judges
Create separate judges for different concerns: one for empathy, one for task completion, one for compliance. This gives you multidimensional visibility.
Recommended Evaluator Templates
Here are proven evaluator prompts you can adapt for your use case (note that they should instruct the judge to score on a scale of 0 to 100):
Task Completion Judge
"Did the agent successfully complete the user's primary request? Score 100 if fully completed, 50 if partially completed, 10 if not attempted or failed."
Empathy & Tone Judge
"Was the agent empathetic, patient, and professional throughout? Score based on language used, acknowledgment of user emotions, and tone consistency. Score 100 for excellent empathy, 50 for neutral, 10 for rude or dismissive."
Compliance Judge
"Did the agent make any false promises, share unverified information, or violate compliance rules? Score 100 for full compliance, 0 for any violation detected."
Handoff Quality Judge
"When the agent transferred the call to a human or another department, was the handoff smooth? Did the agent provide context to the recipient? Score from 0 to 100 based on transition quality."
Cost & Performance
Cost Note: Evaluations only use LLM tokens for the final "Judge" assessment — not for simulating conversations. This makes evaluations highly cost-effective compared to synthetic testing approaches, as they leverage real-world data already captured during production calls.
Each evaluation runs independently and does not affect your voice agent's performance or call handling.