Run a Task

Deterministic runs; artifacts land in .agent_bench/runs/. Replay accepts overrides.

Task Preview

{% if selected_task_meta %}

{{ selected_task_meta.ref }}
Suite: {{ selected_task_meta.suite }}

{{ selected_task_meta.description or 'No description available.' }}

{% else %}

Select a task to see its description.

{% endif %}

Recent Runs

{% if trace_id %} {% endif %}
Reset
{% if recent_runs %}
    {% for entry in recent_runs %}
  • {{ entry.agent }} → {{ entry.task_ref }}
    Seed {{ entry.seed }} · {% if entry.failure_type %} {{ entry.failure_type }} {% else %} success {% endif %} · trace
  • {% endfor %}
{% else %}

No runs logged yet.

{% endif %}

Result JSON

{% if error %}
{{ error }}
{% elif result %} {% set success = result.success %} {% if success %}Success{% else %}Failure{% endif %} {% if result_download_id %}
download trace JSON
{% endif %}
{{ result | tojson(indent=2) }}
{% else %}

Results will appear here after you launch a run.

{% endif %}

Trace Viewer

{% if trace_error %}
{{ trace_error }}
{% elif trace_run %}
Run ID: {{ trace_run.run_id }}
Agent: {{ trace_run.agent }}
Task: {{ trace_run.task_ref }}
Seed: {{ trace_run.seed }}
Download JSON
{% if trace_taxonomy %}

Outcome

{{ trace_taxonomy.label }}
{% endif %} {% if trace_budget_series %}

Budget burn

Steps
Tool calls
{% endif %}
{% if trace_run.action_trace %} {% for entry in trace_run.action_trace %}
Step {{ entry.step }} · {{ entry.action.type }}
{{ entry | tojson(indent=2) }}
{% endfor %} {% else %}

Trace is empty for this run.

{% endif %}
{% else %}

Select a trace from Recent Runs or run a task to view its steps.

{% endif %}

Quick-Start Pairings

One-click launch for known-good agent+task combinations. Equivalent to agent-bench run pairing <name>.

{% for p in pairings %}
{{ p.name }}
{{ p.agent }}
{{ p.task }}
{{ p.description }}
{% if p.last_run_id is not none %} {% else %}
no runs yet
{% endif %}
{% endfor %}

Baselines

{% if trace_id %} {% endif %}
Reset
{% if published_baseline %}
Latest published
{{ published_baseline.generated_at }}
Stored at:
{{ published_baseline._path }}
{% if published_baseline.metadata %}
Filters:
agent={{ published_baseline.metadata.agent_filter or 'all' }}, task={{ published_baseline.metadata.task_filter or 'all' }}
{% endif %}
download JSON
{% endif %} {% if baselines %}
{{ baselines | length }} agent/task combos tracked.
Derived from persisted runs.
{% for row in baselines %} {% endfor %}
Agent Task Success % Avg Steps Avg Tool Calls Seed (latest) Runs Latest
{{ row.agent.replace('agents/', '') }} {{ row.task_ref }} {{ (row.success_rate * 100) | round(1) }} {% if row.avg_steps is not none %}{{ row.avg_steps | round(1) }}{% else %}—{% endif %} {% if row.avg_tool_calls is not none %}{{ row.avg_tool_calls | round(1) }}{% else %}—{% endif %} {% if row.last_seed is not none %}{{ row.last_seed }}{% else %}—{% endif %} {{ row.runs }} {% if row.last_run_id %} {% if row.last_success %}Success{% else %}Failure{% endif %} · view trace {% else %} — {% endif %}
{% else %}

Baseline stats will appear after you record a few runs.

{% endif %}

Compare runs

{% set compare_run_a = compare_inputs.run_a if compare_inputs else '' %} {% set compare_run_b = compare_inputs.run_b if compare_inputs else '' %}
{% if compare_error %}
{{ compare_error }}
{% elif compare_diff %}

Summary

Agent match: {{ compare_diff.summary.same_agent }}
Task match: {{ compare_diff.summary.same_task }}
Success match: {{ compare_diff.summary.same_success }}
Steps: {{ compare_delta.steps_a }} → {{ compare_delta.steps_b }} (Δ {{ compare_delta.steps_delta }})
Tool calls: {{ compare_delta.tools_a }} → {{ compare_delta.tools_b }} (Δ {{ compare_delta.tools_delta }})
{% if compare_step_summary %}

Delta view

{% for entry in compare_step_summary %} {% endfor %}
Step Baseline action Current action Result changed?
{{ entry.step }} {{ entry.action_a or '—' }} {{ entry.action_b or '—' }} {{ entry.result_changed }}
{% endif %}

Step differences

{% if compare_diff.step_diffs %}
{% for entry in compare_diff.step_diffs %}
Step {{ entry.step }}
{{ entry.run_a | tojson(indent=2) }}
{{ entry.run_b | tojson(indent=2) }}
{% endfor %}
{% else %}

No step-level differences detected.

{% endif %}
{% endif %}