{{ selected_task_meta.ref }}
Suite: {{ selected_task_meta.suite }}
{{ selected_task_meta.description or 'No description available.' }}
{% else %}Select a task to see its description.
{% endif %}No runs logged yet.
{% endif %}{{ result | tojson(indent=2) }}
{% else %}
Results will appear here after you launch a run.
{% endif %}{{ entry | tojson(indent=2) }}
Trace is empty for this run.
{% endif %}Select a trace from Recent Runs or run a task to view its steps.
{% endif %}One-click launch for known-good agent+task combinations. Equivalent to agent-bench run pairing <name>.
| Agent | Task | Success % | Avg Steps | Avg Tool Calls | Seed (latest) | Runs | Latest |
|---|---|---|---|---|---|---|---|
| {{ row.agent.replace('agents/', '') }} | {{ row.task_ref }} | {{ (row.success_rate * 100) | round(1) }} | {% if row.avg_steps is not none %}{{ row.avg_steps | round(1) }}{% else %}—{% endif %} | {% if row.avg_tool_calls is not none %}{{ row.avg_tool_calls | round(1) }}{% else %}—{% endif %} | {% if row.last_seed is not none %}{{ row.last_seed }}{% else %}—{% endif %} | {{ row.runs }} | {% if row.last_run_id %} {% if row.last_success %}Success{% else %}Failure{% endif %} · view trace {% else %} — {% endif %} |
Baseline stats will appear after you record a few runs.
{% endif %}| Step | Baseline action | Current action | Result changed? |
|---|---|---|---|
| {{ entry.step }} | {{ entry.action_a or '—' }} | {{ entry.action_b or '—' }} | {{ entry.result_changed }} |
{{ entry.run_a | tojson(indent=2) }}
{{ entry.run_b | tojson(indent=2) }}
No step-level differences detected.
{% endif %}