Reproducibility rates, budget utilisation, and failure taxonomy across all stored runs.
| Task | Agent | Runs | Repro Rate | Steps P50 | TC P50 | Avg Wall (s) | Failure Type | Termination Reason |
|---|---|---|---|---|---|---|---|---|
{{ row.task_ref or "—" }} |
{{ row.run_count }} |
{% set repro = row.reproducibility_rate %}
{% if repro is not none %}
{% set pct = repro * 100 %}
{{ "%.1f"|format(pct) }}%
{% else %}
—
{% endif %}
|
{{ row.steps_p50 or "—" }} | {{ row.tool_calls_p50 or "—" }} | {{ row.avg_wall_clock_s or "—" }} |
{% if row.failure_taxonomy %}
{% for ft, cnt in row.failure_taxonomy.items() %}
{{ ft }}: {{ cnt }}
{% endfor %}
{% else %}
—
{% endif %}
|
{% if row.termination_taxonomy %}
{% for tr, cnt in row.termination_taxonomy.items() %}
{{ tr }}: {{ cnt }}
{% endfor %}
{% else %}
—
{% endif %}
|
agent-bench run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42
GET /api/metrics — supports ?task=<ref>, ?agent=<path>, and ?limit=N query params.
CLI: agent-bench runs metrics --format json