How Do We Know Agents Pick the Right Tool?

The Question

An agent reads a system prompt, reads a user message, and picks a tool to call. Sometimes it picks report_issue when the user wanted a feature request filed on the board. Sometimes it calls web_search when the answer was already in memory. How do we know, at any point in time, that the agents we ship are making the right calls for the instructions they were given?

What We Have Today

Nothing automated. Tool-call correctness is verified by two things: developers running scenarios locally while iterating on prompts, and users hitting edge cases in production and telling us when something felt wrong. Session transcripts are archived, so we can go back and inspect a bad interaction after the fact — but only if someone flags it.

This is honest UAT. It catches obvious failures. It does not catch drift, does not cover scenarios nobody thought to test, and gets more expensive every time we add a tool or tweak a prompt.

What Upstream Has

The same. Neither the upstream framework (agent-purple-shovel/agent-system) nor the Amber fork ship an eval harness for tool-call correctness. There is a skill-creator/eval-viewer for scoring skill trigger descriptions, but that’s for authoring skills — not for verifying an agent’s tool choice in a live conversation.

The Data Is Already There

This is the part worth sitting with. Every agent turn is already persisted: the system prompt in effect, the user message, the tools available, the tool the agent actually called, and the result. The session-archive skill surfaces this and we’ve been capturing it for months. We don’t need new telemetry, new hooks, or new instrumentation to start studying tool-call quality — we need a reader that turns what’s already on disk into a labelled dataset.

That changes the shape of the problem. It isn’t “build an eval system,” it’s “score the corpus we already have and decide what to do with the score.”

Two Directions

Offline evals. A fixture set of (system prompt, user message) → expected tool(s) cases, run on every prompt or model change. Deterministic and cheap. Only covers scenarios someone thought to write down, but catches regressions fast. Seed the fixtures from real archived sessions where the outcome was known-good or known-bad — no hand-authoring from scratch.

LLM-as-judge over real transcripts. Sample live sessions from the archive and have a judge model score whether the tool call matched the user’s stated intent. Catches drift that fixtures miss. Costs tokens per evaluation, and the judge itself can be wrong — calibration is its own problem. Feasible today because the transcripts are already there.

Why This Is Getting More Urgent

Today we have a small number of agents and a small number of users, so UAT keeps up. Every new tool, every prompt tweak, every model upgrade widens the surface. The cost of a silently-wrong tool call is paid by the user, not the developer, and we only hear about the ones people bother to report. At some point “developers catch it” stops being a credible story.

What We’re Asking

Is this worth building now, or is the honest answer to keep it on the backlog until we feel the pain more? Given the data is already sitting in the session archive, the marginal cost of a first pass is a scoring script, not a new system. Where should the harness live — inside memory-mcp, as a separate eval service, or as a CI job running against recorded fixtures? And should we push it upstream, or fork it into Vanilla and carry it as a deviation?