TraceCore

Metrics Dashboard

Reproducibility rates, budget utilisation, and failure taxonomy across all stored runs.

Unique Tasks
{{ total_tasks }}
Total Runs
{{ total_runs }}
{% if metrics_rows %} {% set ns = namespace(sum=0, count=0) %} {% for row in metrics_rows %} {% if row.reproducibility_rate is not none %} {% set ns.sum = ns.sum + row.reproducibility_rate %} {% set ns.count = ns.count + 1 %} {% endif %} {% endfor %}
Avg Repro Rate
{% if ns.count > 0 %}{{ "%.1f"|format(ns.sum / ns.count * 100) }}%{% else %}—{% endif %}
{% endif %}
{% if metrics_rows %}
Per-Task Breakdown
{% for row in metrics_rows %} {% endfor %}
Task Agent Runs Repro Rate Steps P50 TC P50 Avg Wall (s) Failure Type Termination Reason
{{ row.task_ref or "—" }} {{ (row.agent or "—").split("/")[-1] }} {{ row.run_count }} {% set repro = row.reproducibility_rate %} {% if repro is not none %} {% set pct = repro * 100 %}
{{ "%.1f"|format(pct) }}%
{% else %} {% endif %}
{{ row.steps_p50 or "—" }} {{ row.tool_calls_p50 or "—" }} {{ row.avg_wall_clock_s or "—" }}
{% if row.failure_taxonomy %} {% for ft, cnt in row.failure_taxonomy.items() %} {{ ft }}: {{ cnt }} {% endfor %} {% else %} {% endif %}
{% if row.termination_taxonomy %} {% for tr, cnt in row.termination_taxonomy.items() %} {{ tr }}: {{ cnt }} {% endfor %} {% else %} {% endif %}
{% else %}
📊
No run data yet. Run some episodes first:
agent-bench run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42
{% endif %}
Machine-readable data available at GET /api/metrics — supports ?task=<ref>, ?agent=<path>, and ?limit=N query params. CLI: agent-bench runs metrics --format json