Dashboard
Server: local
Port: 8000
Status: Online
Execute

Deterministic runs; artifacts land in .agent_bench/runs/. Replay accepts overrides.

Global metrics
Avg success rate
{% if baselines and baselines|length > 0 %} {{ ((baselines | map(attribute='success_rate') | sum) / baselines|length * 100) | round(1) }}% {% else %}—{% endif %}
Total runs (recent)
{{ recent_runs|length if recent_runs else 0 }}

Aggregate performance data from persisted runs.

Task database
{% if selected_task_meta %}
{{ selected_task_meta.ref }} · {{ selected_task_meta.suite }}

{{ selected_task_meta.description or 'No description available.' }}

{% else %}

Select a task to see its description.

{% endif %}
Execution logs
{% if trace_id %} {% endif %}
Reset
{% if recent_runs %}
    {% for entry in recent_runs %}
  • {{ entry.agent }} → {{ entry.task_ref }}
    Seed {{ entry.seed }} {% if entry.failure_type %} {{ entry.failure_type }} {% else %} success {% endif %} trace
  • {% endfor %}
{% else %}

No runs logged yet.

{% endif %}
Output buffer
JSON
{% if error %}
{{ error }}
{% elif result %} {% set success = result.success %} {% if success %}Success{% else %}Failure{% endif %} {% if result_download_id %} {% endif %}
{{ result | tojson(indent=2) }}
{% else %}

Results will appear here after you launch a run.

{% endif %}
Trace dump
{% if trace_run %}
RUN_ID: {{ trace_run.run_id }} · SEED: {{ trace_run.seed }}
{% endif %}
{% if trace_error %}
{{ trace_error }}
{% elif trace_run %}
Run ID: {{ trace_run.run_id }}
Agent: {{ trace_run.agent }}
Task: {{ trace_run.task_ref }}
Seed: {{ trace_run.seed }}
{% if trace_taxonomy %}

Outcome

{{ trace_taxonomy.label }}
{% endif %} {% if trace_budget_series %}

Budget burn

Steps
Tool calls
{% endif %} {% if trace_io_summary %}

IO audit

Total: {{ trace_io_summary.total }} · FS: {{ trace_io_summary.filesystem }} · NET: {{ trace_io_summary.network }}
{% for entry in trace_io_summary.steps %}
Step {{ entry.step }}{% if entry.action %} · {{ entry.action }}{% endif %}
{% for item in entry.io %} {{ item.type or 'io' }}{% if item.op %} · {{ item.op }}{% endif %}{% if item.path %} · {{ item.path }}{% endif %}{% if item.host %} · {{ item.host }}{% endif %} {% endfor %}
{% endfor %}
{% endif %}
{% if trace_run.action_trace %} {% for entry in trace_run.action_trace %}
Step {{ entry.step }} · {{ entry.action.type }}
{{ entry | tojson(indent=2) }}
{% endfor %} {% else %}

Trace is empty for this run.

{% endif %}
{% else %}

Select a trace from Recent Runs or run a task to view its steps.

{% endif %}
Quick-start pairings

One-click launch for known-good agent+task combinations. Equivalent to agent-bench run pairing <name>.

{% for p in pairings %}
{{ p.name }}
{{ p.agent }}
{{ p.task }}
{{ p.description }}
{% if p.last_run_id is not none %} {% else %}
no runs yet
{% endif %}
{% endfor %}
Benchmarks
{% if trace_id %} {% endif %}
Reset
{% if published_baseline %}
Latest published
{{ published_baseline.generated_at }}
Stored at:
{{ published_baseline._path }}
{% if published_baseline.metadata %}
Filters:
agent={{ published_baseline.metadata.agent_filter or 'all' }}, task={{ published_baseline.metadata.task_filter or 'all' }}
{% endif %}
{% endif %} {% if baselines %}
{{ baselines | length }} agent/task combos tracked.
Derived from persisted runs.
{% for row in baselines %} {% endfor %}
Agent Task Success % Avg Steps Avg Tool Calls Seed (latest) Runs Latest
{{ row.agent.replace('agents/', '') }} {{ row.task_ref }} {{ (row.success_rate * 100) | round(1) }} {% if row.avg_steps is not none %}{{ row.avg_steps | round(1) }}{% else %}—{% endif %} {% if row.avg_tool_calls is not none %}{{ row.avg_tool_calls | round(1) }}{% else %}—{% endif %} {% if row.last_seed is not none %}{{ row.last_seed }}{% else %}—{% endif %} {{ row.runs }} {% if row.last_run_id %} {% if row.last_success %}Success{% else %}Failure{% endif %} · view trace {% else %} — {% endif %}
{% else %}

Baseline stats will appear after you record a few runs.

{% endif %}

Compare runs

{% set compare_run_a = compare_inputs.run_a if compare_inputs else '' %} {% set compare_run_b = compare_inputs.run_b if compare_inputs else '' %}
{% if compare_error %}
{{ compare_error }}
{% elif compare_diff %}

Summary

Agent match: {{ compare_diff.summary.same_agent }}
Task match: {{ compare_diff.summary.same_task }}
Success match: {{ compare_diff.summary.same_success }}
Steps: {{ compare_delta.steps_a }} → {{ compare_delta.steps_b }} (Δ {{ compare_delta.steps_delta }})
Tool calls: {{ compare_delta.tools_a }} → {{ compare_delta.tools_b }} (Δ {{ compare_delta.tools_delta }})
{% if compare_diff.summary.io_audit and (compare_diff.summary.io_audit.added or compare_diff.summary.io_audit.removed) %}
IO added: {{ compare_diff.summary.io_audit.added }} IO removed: {{ compare_diff.summary.io_audit.removed }}
{% endif %} {% if compare_diff.taxonomy %}

Taxonomy shift

Failure type
A: {{ compare_diff.taxonomy.run_a.failure_type or 'none' }} B: {{ compare_diff.taxonomy.run_b.failure_type or 'none' }} {% if compare_diff.taxonomy.same_failure_type %} ✓ match {% else %} ✗ changed {% endif %}
Termination reason
A: {{ compare_diff.taxonomy.run_a.termination_reason or 'none' }} B: {{ compare_diff.taxonomy.run_b.termination_reason or 'none' }} {% if compare_diff.taxonomy.same_termination_reason %} ✓ match {% else %} ✗ changed {% endif %}
{% endif %} {% if compare_diff.budget_delta %}

Budget delta (B − A)

{% set bd = compare_diff.budget_delta %} Steps: {% if bd.steps > 0 %}+{% endif %}{{ bd.steps }} Tool calls: {% if bd.tool_calls > 0 %}+{% endif %}{{ bd.tool_calls }} Wall: {% if bd.wall_clock_s > 0 %}+{% endif %}{{ bd.wall_clock_s }}s
{% endif %} {% if compare_step_summary %}

Delta view

{% for entry in compare_step_summary %} {% endfor %}
Step Baseline action Current action Result changed?
{{ entry.step }} {{ entry.action_a or '—' }} {{ entry.action_b or '—' }} {{ entry.result_changed }}
{% endif %} {% set io_steps = compare_diff.step_diffs | selectattr('io_audit_delta', 'defined') | list %} {% if io_steps %}

IO Drift

{% for entry in io_steps %}
Step {{ entry.step }} — IO changes {% set delta = entry.io_audit_delta %} {% if delta.added %}
Added
{% for item in delta.added %} {{ item.type or 'io' }} {% if item.op %}· {{ item.op }}{% endif %}{% if item.path %} · {{ item.path }}{% endif %}{% if item.host %} · {{ item.host }}{% endif %} {% endfor %}
{% endif %} {% if delta.removed %}
Removed
{% for item in delta.removed %} {{ item.type or 'io' }} {% if item.op %}· {{ item.op }}{% endif %}{% if item.path %} · {{ item.path }}{% endif %}{% if item.host %} · {{ item.host }}{% endif %} {% endfor %}
{% endif %} {% if not delta.added and not delta.removed %}

No IO drift for this step.

{% endif %}
{% endfor %}
{% endif %}

Step differences

{% if compare_diff.step_diffs %}
{% for entry in compare_diff.step_diffs %}
Step {{ entry.step }}
{{ entry.run_a | tojson(indent=2) }}
{{ entry.run_b | tojson(indent=2) }}
{% endfor %}
{% else %}

No step-level differences detected.

{% endif %}
{% endif %}
IO Audit
Filesystem & network access recorded per step
{% if recent_runs %} {% for r in recent_runs %} {% endfor %} {% endif %}

Browse run

{% if recent_runs %}
{% for r in recent_runs %} {% endfor %}
{% endif %}
Select a recent run above or enter a run ID, then click Load.

IO diff — Run A vs Run B

{% if recent_runs %}
Click to set Run A → B in order:
{% for r in recent_runs %} {% endfor %}
{% endif %}
Select two runs above and click Diff.
Plugin Registry
Discovered task plugins & lint status
{% if plugin_registry %}
{{ plugin_registry|length }} plugin(s)
{% for p in plugin_registry %}
{{ p.id }} v{{ p.version }}
{{ p.suite }} {% if p.source == 'local' %} local {% else %} bundled {% endif %} {% if p.lint_ok == true %} ✓ lint {% elif p.lint_ok == false %} ✗ lint {% endif %}
{% if p.description %}

{{ p.description | truncate(100, True, '…') }}

{% endif %} {% if p.actions %}
{% for action in p.actions %} {{ action }} {% endfor %}
{% endif %} {% if p.lint_errors %}
    {% for err in p.lint_errors %}
  • {{ err }}
  • {% endfor %}
{% endif %}
{% endfor %}
{% else %}

No task plugins discovered.

{% endif %}