Agents in production

Agent infrastructure

Agent evals: how to measure tool calls, trajectories, and production reality

Agent evals need to grade tool calls, trajectories, and production behaviour — not just final outputs. Here's how each layer works and why most teams miss it.

9 minute read
Decorative imagery showcasing Pontil's brand

How do you know your agent is actually working? Not in a demo. In production, on real customer data, against the messy edges your QA suite never imagined. That's the agent evals question, and most teams answer it badly — usually because they're still grading the model when they should be grading the system.

Our view: agent evals are not LLM evals with extra steps. They're a measurement discipline that has to cover three layers at once — the final output, the trajectory that produced it, and the individual tool calls along the way. Most teams measure the first, ignore the second, and trust the third. That's why agents pass eval suites and still fail in production.

This piece covers what agent evals actually are, why output-only grading misses the point, how to evaluate tool call correctness, how to score trajectories without overfitting to a script, and what production reality demands that offline benchmarks can't give you.

What agent evals are actually measuring

An LLM eval grades a model's response to a prompt. Input goes in, text comes out, you score the text. The contract is small and the surface area is the model.

An agent eval grades a system that decides, acts, and adjusts. The model is one component. The tools the agent invokes are another. The orchestration layer that decides when to call what is a third. The auth context that determines what the agent can see is a fourth. A bad answer can come from any of these, and the failure mode looks identical from the outside: the agent did the wrong thing.

This is why AI agent evaluation needs to decompose. You're not measuring "is the answer good." You're measuring four things, roughly in this order:

  1. Tool call correctness — did the agent call the right tool, with the right arguments, at the right time?
  2. Trajectory quality — was the path from prompt to answer reasonable, or did it loop, backtrack, or take a five-step route through a two-step problem?
  3. Final output quality — is the answer correct, complete, and in the format the caller expected?
  4. Operational behaviour — latency, cost, retries, partial failures, and what happened when something went wrong.

Most eval frameworks bias hard toward layer 3. That's where the public benchmarks live and where the existing LLM evaluation tooling already works. It's also the layer that tells you least about why your agent stalled.

Why output-only grading misses the point

Grading only the final output has a structural problem: two completely different runs can produce the same answer. One agent might call the right tool once and return cleanly. Another might call the wrong tool three times, hallucinate a partial result, then accidentally land on the right answer through retries. Same output. Wildly different production behaviour.

The second agent will blow up the first time the wrong tool returns an error the retry logic doesn't handle. It'll also cost three times as much per call and add five seconds of latency. None of this shows up if your eval only looks at the last message.

There's a second problem, more subtle. Output-only grading rewards plausibility. LLMs are good at sounding correct. An agent that confidently invents a customer ID and returns a confident-sounding answer will often score higher on output grading than an agent that correctly returns "I couldn't find that customer." The second agent is the one you want in production. The first one is the one that gets you on the front page of Hacker News for the wrong reasons.

The fix isn't to abandon output grading. It's to stop treating it as the whole eval. Output grading tells you whether the answer was good. Tool call grading and trajectory grading tell you whether the answer was earned. You need both.

Evaluating tool calls: the four checks that matter

Tool call correctness is the layer most teams underinvest in, partly because the existing AI observability stack — even good ones — wasn't built to grade tool calls specifically. We covered the broader split between LLM observability tools and agent tracing platforms elsewhere; here the focus is narrower. What does it actually mean to grade a tool call?

Four checks, in order of how much they tell you:

1. Tool selection. Did the agent pick the right tool for the step? This is the highest-signal check, because picking the wrong tool is almost always a downstream-fatal error. If the agent calls search_customers when it should have called get_customer_by_id, no amount of clever argument handling will save it. Grade tool selection on a per-step basis, against a labelled set of expected tools per step.

2. Argument correctness. Given the right tool, were the arguments right? This splits into two: were the required arguments present and well-typed, and were the values correct? A tool call that passes { "customer_id": "acme-corp" } when the actual ID is cust_8472 is syntactically fine and semantically broken. Argument grading needs both a schema check (cheap, deterministic) and a value check (harder, often needs LLM-as-judge or a ground-truth dataset).

3. Argument grounding. Where did the argument values come from? In a well-behaved trajectory, every argument value either comes from the user's prompt, from a previous tool call's result, or from a documented default. If you can't trace an argument back to one of those sources, the agent invented it. Hallucinated arguments are one of the most common silent failure modes in production agents, and they're invisible to output-only grading.

4. Result handling. Did the agent correctly interpret what the tool returned? A tool that returns { "status": "not_found" } should not produce an agent response that says "I found the customer." This sounds obvious. It happens constantly.

These four checks compose. An agent can pick the right tool, pass the right arguments, ground them properly, and still mishandle the result. Each check is independent signal.

The related question — what counts as "the right tool" — is harder than it looks when your tool catalogue is large. We've written about tool calling vs function calling and why the same mechanism produces different production realities depending on how the tool surface is designed. The short version: if your tools overlap in capability, your agent will pick the wrong one, and your evals need to catch that.

Agent trajectory evaluation

A trajectory is the sequence of steps an agent took to get from the initial prompt to the final response. For a five-tool-call task, the trajectory is the ordered list of (thought, tool call, result) tuples. Evaluating trajectories means asking: was this a good path?

There are two ways to grade trajectories, and most teams pick the wrong one first.

Exact-match trajectory grading

You define a reference trajectory — the exact sequence of tool calls a perfect agent would make — and you score how closely the actual trajectory matches. This is appealing because it's deterministic, easy to score, and easy to explain to stakeholders.

It's also brittle. Real agents take valid alternative paths constantly. If the reference says "call list_orders then filter" and the agent calls search_orders with a filter argument, exact-match grading penalises a behaviour that was arguably better. You end up evaluating compliance with a script, not problem-solving.

Exact-match grading works for narrow, deterministic tasks where there genuinely is one correct path — think single-purpose tool use, regulated workflows, or compliance checks where deviating from the script is itself a failure. For everything else it overfits.

Trajectory quality grading

The alternative is to grade the trajectory on properties, not exact match. Useful properties:

  • Step efficiency. How many tool calls did the trajectory take vs the minimum it could have taken? A trajectory that took 7 calls when 3 would do isn't broken, but it's wasteful, and it's a leading indicator of agents that will time out or burn budget in production.
  • Loop detection. Did the agent call the same tool with the same arguments twice? Three times? Loops are nearly always a sign that the agent is confused and the tool isn't returning what the agent expects.
  • Backtrack count. How often did the agent abandon a partial result and start over? Some backtracking is healthy. A lot of it means the agent is guessing.
  • Reasoning coherence. Did each step's reasoning follow from the previous step's result? This typically needs LLM-as-judge grading and benchmarking against a labelled set.

Property-based trajectory grading is more work to set up than exact-match. It also degrades more gracefully when your agent surface changes — adding a new tool doesn't invalidate your eval set, it just changes what "efficient" looks like.

The broader pattern here connects to how agentic workflows actually break in production — and trajectory evaluation is one of the few ways to catch breakage before it surfaces as a customer complaint.

What production reality demands

Offline eval suites run on canned data, with stubbed tools, in a controlled environment. They're necessary. They're not sufficient.

Production agents fail on things that don't appear in offline evals: third-party API rate limits, auth tokens that expire mid-trajectory, tools that return malformed responses, tools whose contracts changed in a way nobody told the agent team about, customer data that doesn't match the shape the agent was trained to expect. None of this shows up in an eval suite running against mocked tool responses.

This means a few things in practice:

Production grading has to be continuous. A weekly eval run against a fixed test set tells you about regressions in known behaviour. It tells you nothing about the long tail of failures that show up in real customer traffic. You need sampling-based grading on live trajectories — pull a representative slice of real production runs, grade them on the four tool-call checks and the trajectory properties above, and watch the trends.

Tool contract stability matters as much as eval coverage. If the tools your agent depends on are changing underneath you, your eval scores will move and you won't know why. This is one of the reasons we've written about API breaking changes and how to detect them — silent tool drift is one of the largest categories of "my agent used to work and now it doesn't" failures, and conventional evals don't catch it.

Auth context has to be in the eval. An agent that works fine with an admin token and fails for users with restricted permissions is broken, but offline evals usually run with a single test identity and miss the entire class of failure. Production grading needs to cover the actual permission scopes your agents run under.

Failure modes need their own grading. What does the agent do when a tool returns an error? When a tool times out? When the auth token has expired? These are deterministic test cases — you can inject them — and they're often the difference between an agent that degrades gracefully and one that loops or hallucinates. According to the 2025 MuleSoft Connectivity Benchmark Report, IT teams spend 39% of their time designing, building, and testing custom integrations — one signal that integration surfaces don't stabilise on their own; agent failure-mode evals are how you stay ahead of that drift.

How Pontil fits

Most of what makes agent evals hard isn't the grading methodology. It's the surface you're grading against.

If your tools are bespoke wrappers around a handful of APIs, with hand-written argument schemas and ad-hoc error handling, your evals end up measuring two things at once — the agent's behaviour and the tool layer's stability. When eval scores drop, you can't tell which moved. That's the real reason agent projects stall at the production gate.

Pontil sits in the tools layer of the agent stack. We generate connectors directly from your existing APIs, run them on a managed runtime with consistent error handling and per-user auth, and keep them current as your products change. The grading benefit is concrete: tool schemas are stable, argument grounding is traceable, auth context flows through to every call, and failure modes are uniform across the tool catalogue. Your evals get to measure the agent, not the integration surface underneath it. If that's the shape of the problem you're hitting, the Pontil product page covers how the runtime works in more detail.

What does a useful eval pipeline look like in six months?

The teams getting agent evals right today are building toward something specific: continuous, layered grading that runs on real production trajectories, decomposes failures by tool/trajectory/output, and treats the tool layer as an evaluable surface in its own right rather than an opaque dependency.

That's a different shape from the eval suites most teams are running now — which tend to be batch jobs against fixed prompts, grading final outputs with LLM-as-judge, run weekly and reviewed during the agent team's standup. Those suites aren't wrong. They're just measuring the layer that fails least.

The agents that ship and stay shipped will be the ones whose teams measured the tools, the trajectories, and the production behaviour — not just the model's last word. If your evals don't currently cover all three, the gap between your offline scores and your production behaviour is going to keep growing. Closing it is mostly a question of where you put the next month of measurement work.

Join our weekly newsletter

Stay up to date on the ever changing agentic landscape.

POSTS

Related content

Agents in production

Agent infrastructure

AI agent observability: LLM observability tools vs agent tracing platforms

7 minute read

Agent infrastructure

Agents in production

Tool calling vs function calling: the same mechanism, two production realities

8 minute read

Agent infrastructure

Agents in production

Agentic workflows: what they are, how they break, and what makes them production-grade

8 minute read