Case Study

Replacing Glob, Grep, and List‑Dir
with a Single Discover Call

A benchmark migration showing how custom tool design compresses retrieval into one bounded report—and why that matters most for less‑SOTA models.

Repository: tunacode Audience: Agent tool designers 12 real queries, no synthetic fixtures
Motivating Insight
The format choice matters as much as the model itself. LLM coding performance is artificially constrained not by model capability but by the harness—the tool interface that translates model outputs into action.
Can Bölük, “The Harness Problem” (2026)
Round Trips
1 vs 6
discover vs legacy chain
Output Tokens
~4.2x less
~3.1k vs ~12.9k
Latency
~9% faster
cold + warm averages
Actionability
0.67 vs 0.42
warm mode score
01

Why the First Benchmark Failed

The original benchmark measured the wrong thing. Synthetic repos with planted keywords and a toy baseline obscured what actually matters: retrieval utility under real ambiguity.

What went wrong

  • Synthetic repositories had planted keywords and low ambiguity
  • The baseline was a toy loop, not the old production tool chain overhead
  • Raw speed compared unlike work: discover does scoring, symbol extraction, clustering

Correction applied

  • Run against TunaCode itself—a real, evolving codebase
  • Model old behavior as separate tool calls with adapter/decorator overhead
  • Measure retrieval utility and actionability, not file-walk speed alone
02

How Discover is Implemented

Typed dataclasses internally, one serialized report string externally. The model never sees intermediate scan state. It sees the bounded report.

Core Types + Pipeline src/tunacode/tools/discover.py
-- Core types --
Relevance       (enum)
FileEntry       (path, relevance, role, key_symbols, imports_from, line_count, excerpt)
ConceptCluster  (name, description, files)
DiscoveryReport (query, summary, clusters, file_tree, ...)

-- Pipeline --
extract_terms -> generate_patterns -> collect_candidates -> evaluate_prospects
  -> cluster -> build_tree -> DiscoveryReport -> to_context()

-- Tool boundary --
async def discover(...):
    report = await asyncio.to_thread(_discover_sync, ...)
    return report.to_context()
03

How the Real Benchmark Works

The rewritten benchmark in tests/benchmarks/bench_discover.py compares discover (one tool call via tinyagent adapter) against the legacy chain (list_dir -> glob -> grep -> read_file via separate tool calls).

Dimension Measurement Why it matters
Latency per-query cold/warm + p50/p95 Real responsiveness under setup overhead
Tool calls calls per successful retrieval Round-trip and orchestration burden
Output footprint ceil(chars / 4) estimated tokens Context cost and prompt pressure
File recall hits on expected file set Did we find the right files
Symbol recall hits on expected symbol set Did we return useful handles for edits
Actionability top-rank hit or sufficient recall Can the agent choose the next action immediately

Query set: 12 real TunaCode topics (compaction, cache manager, LSP diagnostics, model picker, tool registration, etc.). No synthetic fixtures.

04

Before vs After: Control Flow

The classic glob -> grep -> read flow is flexible, but it pushes planning, branching, and synthesis work onto the model. That is exactly where less-SOTA models become inconsistent. Discover shifts that work into the tool.

Legacy (model-managed)

query
list_dir
glob
grep
read_file (N times)
reconcile noisy outputs
choose next file

Discover (tool-managed)

query
discover
typed report
choose next file
Step Legacy chain Discover flow
Query interpretation Model invents patterns, retries search terms, decides file scopes Tool extracts terms and expands concepts with a fixed internal pipeline
Retrieval Multiple calls and manual narrowing across raw outputs Single call with bounded candidate collection and scoring
Context handoff Raw text from different tools, mixed granularity One typed report: clusters, symbols, roles, excerpts
Weaker model behavior Higher variance: more branch points and interpretation burden More deterministic: fewer branches, clearer next target
Iterative Retrieval Flowchart (Mermaid)
flowchart TD
    A[Natural Language Query] --> B[Generate initial glob patterns + related terms]
    B --> C[Glob aggressively 3x to build prospect list]
    C --> D[Add all prospects to Queue]

    subgraph Iterative Processing
        E{Queue empty?} -->|No| F[Take next prospect]
        F --> G[Read first 200 lines]
        G --> H{Relevant based on current context?}
        H -->|Yes| I[Extract structured info + Update ontological map]
        I --> J[Discover new related files + Add them to Queue]
        J --> E
        H -->|No| K{Maybe later?}
        K -->|Yes, requeue| L[Push back to Queue with lower priority]
        L --> E
        K -->|No, discard| M[Discard prospect]
        M --> E
    end

    E -->|Yes| N[Consolidate ontological map]
    N --> O[Generate concise typed report for main agent]
    O --> P[Render Mermaid chart of ontology]

This is the core steering advantage: custom tools remove planning entropy. Less-smart models no longer need to invent a search strategy mid-flight; they execute a deterministic loop and act on a single high-signal report.

05

Measured Outcomes

Saved baseline run from tests/benchmarks/bench_discover.py after the benchmark rewrite.

Mode Latency Calls Tokens File Recall Symbol Recall Actionable
Cold 2587 ms vs 2811 1.0 vs 6.0 3063 vs 12858 0.67 vs 0.46 0.50 vs 0.29 0.67 vs 0.50
Warm 2597 ms vs 2832 1.0 vs 6.0 3063 vs 12984 0.67 vs 0.40 0.50 vs 0.31 0.67 vs 0.42

Interpretation

  • Discover reduced round trips by 6x
  • Context footprint reduced by about 4.2x
  • Faster despite richer processing (scoring, clustering, symbol extraction)
  • Improved retrieval quality on this query set

What this does not claim

  • Not a universal claim across all repositories
  • Not a claim that clustering alone drives wins
  • Not a replacement for task-level evals
06

Why Custom Tools Help Less-SOTA Models

Frontier models can brute-force noisy tool logs better than mid-tier models. Less-SOTA models are more sensitive to branchy search plans, noisy outputs, and ranking ambiguity. Custom tools narrow that gap by packaging retrieval into a higher-level primitive.

Failure mode (weaker model) Custom tool mitigation
Loses thread across 3–6 tool calls Single-call report with summary + clusters + symbols
Wastes tokens on irrelevant raw output Bounded report size and relevance filtering
Picks wrong file from weak ranking signals Role labeling, symbol extraction, relevance tiers
Higher variance in planning quality Stable output shape per query

The practical strategy is not “make the model smarter” first. It is “make the interface to code search less lossy” first.

07

Commit Timeline

92d32415
Rewrite benchmark on real repo
Replaced synthetic benchmark with end-to-end A/B harness
71b67edc
Hook exclusion for benchmark file length
Allows large benchmark harness to pass pre-commit
03b23121
Remove glob/grep/list_dir
Full cutover to discover-first workflow
bd441d7e
Add discover tool + tests + prompt
Introduced unified discovery primitive