Execute
Global metrics
Avg success rate
{% if baselines and baselines|length > 0 %}
{{ ((baselines | map(attribute='success_rate') | sum) / baselines|length * 100) | round(1) }}%
{% else %}—{% endif %}
Total runs (recent)
{{ recent_runs|length if recent_runs else 0 }}
Aggregate performance data from persisted runs.
Task database
{{ selected_task_meta.ref }} · {{ selected_task_meta.suite }}
{{ selected_task_meta.description or 'No description available.' }}
{% else %}Select a task to see its description.
{% endif %}Execution logs
-
{% for entry in recent_runs %}
-
{{ entry.agent }} → {{ entry.task_ref }}
{% endfor %}
No runs logged yet.
{% endif %}Output buffer
JSON
{% if error %}
{{ error }}
{% elif result %}
{% set success = result.success %}
{% if success %}Success{% else %}Failure{% endif %}
{% if result_download_id %}
{% endif %}
{{ result | tojson(indent=2) }}
{% else %}
Results will appear here after you launch a run.
{% endif %}Trace dump
{% if trace_run %}RUN_ID: {{ trace_run.run_id }} · SEED: {{ trace_run.seed }}
{% endif %}
{{ trace_error }}
{% elif trace_run %}
{% if trace_taxonomy %}
{% endif %}
{% if trace_budget_series %}
{% endif %}
{% if trace_io_summary %}
{% endif %}
Outcome
{{ trace_taxonomy.label }}
Budget burn
Steps
Tool calls
IO audit
Total: {{ trace_io_summary.total }} · FS: {{ trace_io_summary.filesystem }} · NET: {{ trace_io_summary.network }}
{% for entry in trace_io_summary.steps %}
{% endfor %}
Step {{ entry.step }}{% if entry.action %} · {{ entry.action }}{% endif %}
{% for item in entry.io %}
{{ item.type or 'io' }}{% if item.op %} · {{ item.op }}{% endif %}{% if item.path %} · {{ item.path }}{% endif %}{% if item.host %} · {{ item.host }}{% endif %}
{% endfor %}
{% if trace_run.action_trace %}
{% for entry in trace_run.action_trace %}
{% endfor %}
{% else %}
{% else %}
Step {{ entry.step }} · {{ entry.action.type }}
{{ entry | tojson(indent=2) }}
Trace is empty for this run.
{% endif %}Select a trace from Recent Runs or run a task to view its steps.
{% endif %}Quick-start pairings
One-click launch for known-good agent+task combinations. Equivalent to agent-bench run pairing <name>.
{% for p in pairings %}
{% endfor %}
{{ p.name }}
{{ p.agent }}
{{ p.task }}
{{ p.description }}
{% if p.last_run_id is not none %}
{% if p.last_success %}✓ last: success{% else %}✗ last: failed{% endif %}
seed {{ p.last_seed }}
{% else %}
no runs yet
{% endif %}
Benchmarks
| Agent | Task | Success % | Avg Steps | Avg Tool Calls | Seed (latest) | Runs | Latest |
|---|---|---|---|---|---|---|---|
| {{ row.agent.replace('agents/', '') }} | {{ row.task_ref }} | {{ (row.success_rate * 100) | round(1) }} | {% if row.avg_steps is not none %}{{ row.avg_steps | round(1) }}{% else %}—{% endif %} | {% if row.avg_tool_calls is not none %}{{ row.avg_tool_calls | round(1) }}{% else %}—{% endif %} | {% if row.last_seed is not none %}{{ row.last_seed }}{% else %}—{% endif %} | {{ row.runs }} | {% if row.last_run_id %} {% if row.last_success %}Success{% else %}Failure{% endif %} · view trace {% else %} — {% endif %} |
Baseline stats will appear after you record a few runs.
{% endif %}Compare runs
{% if compare_error %}{{ compare_error }}
{% elif compare_diff %}
Summary
Agent match: {{ compare_diff.summary.same_agent }}
Task match: {{ compare_diff.summary.same_task }}
Success match: {{ compare_diff.summary.same_success }}
Steps: {{ compare_delta.steps_a }} → {{ compare_delta.steps_b }}
(Δ {{ compare_delta.steps_delta }})
Tool calls: {{ compare_delta.tools_a }} → {{ compare_delta.tools_b }}
(Δ {{ compare_delta.tools_delta }})
IO added: {{ compare_diff.summary.io_audit.added }}
IO removed: {{ compare_diff.summary.io_audit.removed }}
{% endif %}
{% if compare_diff.taxonomy %}
Taxonomy shift
Failure type
A: {{ compare_diff.taxonomy.run_a.failure_type or 'none' }}
→
B: {{ compare_diff.taxonomy.run_b.failure_type or 'none' }}
{% if compare_diff.taxonomy.same_failure_type %}
✓ match
{% else %}
✗ changed
{% endif %}
Termination reason
A: {{ compare_diff.taxonomy.run_a.termination_reason or 'none' }}
→
B: {{ compare_diff.taxonomy.run_b.termination_reason or 'none' }}
{% if compare_diff.taxonomy.same_termination_reason %}
✓ match
{% else %}
✗ changed
{% endif %}
Budget delta (B − A)
{% set bd = compare_diff.budget_delta %}
Steps: {% if bd.steps > 0 %}+{% endif %}{{ bd.steps }}
Tool calls: {% if bd.tool_calls > 0 %}+{% endif %}{{ bd.tool_calls }}
Wall: {% if bd.wall_clock_s > 0 %}+{% endif %}{{ bd.wall_clock_s }}s
{% endif %}
{% if compare_step_summary %}
Delta view
| Step | Baseline action | Current action | Result changed? |
|---|---|---|---|
| {{ entry.step }} | {{ entry.action_a or '—' }} | {{ entry.action_b or '—' }} | {{ entry.result_changed }} |
IO Drift
{% for entry in io_steps %}
{% endfor %}
{% endif %}
Step {{ entry.step }} — IO changes
{% set delta = entry.io_audit_delta %} {% if delta.added %}
Added
{% endif %}
{% if delta.removed %}
{% for item in delta.added %}
{{ item.type or 'io' }} {% if item.op %}· {{ item.op }}{% endif %}{% if item.path %} · {{ item.path }}{% endif %}{% if item.host %} · {{ item.host }}{% endif %}
{% endfor %}
Removed
{% endif %}
{% if not delta.added and not delta.removed %}
{% for item in delta.removed %}
{{ item.type or 'io' }} {% if item.op %}· {{ item.op }}{% endif %}{% if item.path %} · {{ item.path }}{% endif %}{% if item.host %} · {{ item.host }}{% endif %}
{% endfor %}
No IO drift for this step.
{% endif %}Step differences
{% if compare_diff.step_diffs %}
{% for entry in compare_diff.step_diffs %}
{% endfor %}
{% else %}
Step {{ entry.step }}
{{ entry.run_a | tojson(indent=2) }}
{{ entry.run_b | tojson(indent=2) }}
No step-level differences detected.
{% endif %}IO Audit
Filesystem & network access recorded per step
Browse run
{% for r in recent_runs %}
{% endfor %}
{% endif %}
Select a recent run above or enter a run ID, then click Load.
IO diff — Run A vs Run B
Click to set Run A → B in order:
{% for r in recent_runs %}
{% endfor %}
Select two runs above and click Diff.
Plugin Registry
Discovered task plugins & lint status
{{ plugin_registry|length }} plugin(s)
{% for p in plugin_registry %}
{% if p.description %}
{% endfor %}
{% else %}
{{ p.id }}
v{{ p.version }}
{{ p.suite }}
{% if p.source == 'local' %}
local
{% else %}
bundled
{% endif %}
{% if p.lint_ok == true %}
✓ lint
{% elif p.lint_ok == false %}
✗ lint
{% endif %}
{{ p.description | truncate(100, True, '…') }}
{% endif %} {% if p.actions %}
{% for action in p.actions %}
{% endif %}
{% if p.lint_errors %}
{{ action }}
{% endfor %}
-
{% for err in p.lint_errors %}
- {{ err }} {% endfor %}
No task plugins discovered.
{% endif %}