Each turn is evaluated on five quantitative metrics using LLM-as-a-judge:
| Range | Label | Meaning |
|---|---|---|
| 1.0 – 2.0 | Poor | Response fails to meet basic expectations |
| 2.0 – 3.0 | Needs Improvement | Partially addresses the task but has significant gaps |
| 3.0 – 4.0 | Good | Adequately addresses the task with minor issues |
| 4.0 – 5.0 | Excellent | Fully addresses the task with high quality |
If any turn-level metric scores below 3.0, a qualitative behaviour failure
evaluation is triggered. The threshold of 3.0 ("Good") was chosen because scores below
this indicate the response has significant gaps worth diagnosing. Failure categories include:
Each score includes an LLM-generated reasoning explaining why the score was given. For turns with behaviour failures, the detailed failure reason is shown in the conversation modal under "Issue Details".
Each conversation receives a Final Score (0–1) computed as:
final_score = turn_success_ratio × 0.75 + goal_completion × 0.25
(total_turns − behaviour_failures) / total_turns.
Measures the fraction of turns without agent behavior failures.
Weighted higher because consistent turn-level quality is the primary indicator
of a reliable agent.1 if the user's goal was successfully addressed,
0 otherwise. Evaluated by LLM at conversation end.
Weighted lower because an agent can complete the goal while still exhibiting
problematic behaviour in individual turns.Conversation status is determined by the final score:
1.0 = Done — perfect score, no failures≥ 0.6 = Partial Failure — goal may be met but with some turn-level issues< 0.6 = Failed — significant quality issuesAfter all turns are evaluated, the agent behavior failure reasons are collected and deduplicated across conversations using an LLM. This produces a list of unique errors, each with a root cause description and the specific conversation turns where it occurred. For automated fix suggestions, see arklex.ai.
| chat id | scenario id | goal completion score | final score | status |
|---|