A benchmark migration showing how custom tool design compresses retrieval into one bounded report—and why that matters most for less‑SOTA models.
The format choice matters as much as the model itself. LLM coding performance is artificially constrained not by model capability but by the harness—the tool interface that translates model outputs into action.
The original benchmark measured the wrong thing. Synthetic repos with planted keywords and a toy baseline obscured what actually matters: retrieval utility under real ambiguity.
Typed dataclasses internally, one serialized report string externally. The model never sees intermediate scan state. It sees the bounded report.
-- Core types -- Relevance (enum) FileEntry (path, relevance, role, key_symbols, imports_from, line_count, excerpt) ConceptCluster (name, description, files) DiscoveryReport (query, summary, clusters, file_tree, ...) -- Pipeline -- extract_terms -> generate_patterns -> collect_candidates -> evaluate_prospects -> cluster -> build_tree -> DiscoveryReport -> to_context() -- Tool boundary -- async def discover(...): report = await asyncio.to_thread(_discover_sync, ...) return report.to_context()
The rewritten benchmark in tests/benchmarks/bench_discover.py compares
discover (one tool call via tinyagent adapter) against the
legacy chain (list_dir -> glob -> grep -> read_file
via separate tool calls).
| Dimension | Measurement | Why it matters |
|---|---|---|
| Latency | per-query cold/warm + p50/p95 | Real responsiveness under setup overhead |
| Tool calls | calls per successful retrieval | Round-trip and orchestration burden |
| Output footprint | ceil(chars / 4) estimated tokens | Context cost and prompt pressure |
| File recall | hits on expected file set | Did we find the right files |
| Symbol recall | hits on expected symbol set | Did we return useful handles for edits |
| Actionability | top-rank hit or sufficient recall | Can the agent choose the next action immediately |
Query set: 12 real TunaCode topics (compaction, cache manager, LSP diagnostics, model picker, tool registration, etc.). No synthetic fixtures.
The classic glob -> grep -> read flow is flexible, but it pushes planning,
branching, and synthesis work onto the model. That is exactly where less-SOTA models
become inconsistent. Discover shifts that work into the tool.
| Step | Legacy chain | Discover flow |
|---|---|---|
| Query interpretation | Model invents patterns, retries search terms, decides file scopes | Tool extracts terms and expands concepts with a fixed internal pipeline |
| Retrieval | Multiple calls and manual narrowing across raw outputs | Single call with bounded candidate collection and scoring |
| Context handoff | Raw text from different tools, mixed granularity | One typed report: clusters, symbols, roles, excerpts |
| Weaker model behavior | Higher variance: more branch points and interpretation burden | More deterministic: fewer branches, clearer next target |
flowchart TD
A[Natural Language Query] --> B[Generate initial glob patterns + related terms]
B --> C[Glob aggressively 3x to build prospect list]
C --> D[Add all prospects to Queue]
subgraph Iterative Processing
E{Queue empty?} -->|No| F[Take next prospect]
F --> G[Read first 200 lines]
G --> H{Relevant based on current context?}
H -->|Yes| I[Extract structured info + Update ontological map]
I --> J[Discover new related files + Add them to Queue]
J --> E
H -->|No| K{Maybe later?}
K -->|Yes, requeue| L[Push back to Queue with lower priority]
L --> E
K -->|No, discard| M[Discard prospect]
M --> E
end
E -->|Yes| N[Consolidate ontological map]
N --> O[Generate concise typed report for main agent]
O --> P[Render Mermaid chart of ontology]
This is the core steering advantage: custom tools remove planning entropy. Less-smart models no longer need to invent a search strategy mid-flight; they execute a deterministic loop and act on a single high-signal report.
Saved baseline run from tests/benchmarks/bench_discover.py after the benchmark rewrite.
| Mode | Latency | Calls | Tokens | File Recall | Symbol Recall | Actionable |
|---|---|---|---|---|---|---|
| Cold | 2587 ms vs 2811 | 1.0 vs 6.0 | 3063 vs 12858 | 0.67 vs 0.46 | 0.50 vs 0.29 | 0.67 vs 0.50 |
| Warm | 2597 ms vs 2832 | 1.0 vs 6.0 | 3063 vs 12984 | 0.67 vs 0.40 | 0.50 vs 0.31 | 0.67 vs 0.42 |
Frontier models can brute-force noisy tool logs better than mid-tier models. Less-SOTA models are more sensitive to branchy search plans, noisy outputs, and ranking ambiguity. Custom tools narrow that gap by packaging retrieval into a higher-level primitive.
| Failure mode (weaker model) | Custom tool mitigation |
|---|---|
| Loses thread across 3–6 tool calls | Single-call report with summary + clusters + symbols |
| Wastes tokens on irrelevant raw output | Bounded report size and relevance filtering |
| Picks wrong file from weak ranking signals | Role labeling, symbol extraction, relevance tiers |
| Higher variance in planning quality | Stable output shape per query |
The practical strategy is not “make the model smarter” first. It is “make the interface to code search less lossy” first.