Agents in production

Agent infrastructure

AI agent observability: LLM observability tools vs agent tracing platforms

AI agent observability compared: LLM observability tools vs agent tracing platforms. When each wins, where they fail, and what production agents actually need.

7 minute read
Decorative imagery showcasing Pontil's brand

Your agent ran a customer refund workflow at 3am. Something went wrong. The user got two refunds. The support ticket lands on your desk at 9am, and the only artifact you have is a chat transcript and a prompt that says "refund the customer per policy."

This is the AI agent observability problem. Logs tell you the model returned text. They don't tell you which tool fired, what arguments it sent, whether a retry double-charged, or which user identity carried the call. Two categories of product now compete to fix this: LLM observability tools (extended from the prompt-and-response world) and agent tracing platforms (built around the tool-call graph). They overlap — and increasingly converge. They are not the same.

This comparison is for Heads of AI and engineering leads picking one — or, more often, picking one knowing they'll need both. The short version: LLM observability wins when your problem is model behaviour. Agent tracing wins when your problem is what happened after the model decided.

How LLM observability tools work

LLM observability grew out of the prompt-engineering era. The unit of observation is the model call: prompt in, completion out, tokens counted, latency measured, cost attributed. Vendors like Langfuse, Helicone, Arize (which began in classical ML monitoring before expanding into LLMs), and LangSmith all built their early product around this shape. The instrumentation is usually a wrapper around the foundation model SDK, or a proxy that intercepts API calls to OpenAI and Anthropic.

What you get is rich data about the model itself. Token usage broken down by prompt and completion. Latency percentiles per model version. Cost per request, per user, per feature. Quality scoring through automated evals or human annotation. Side-by-side prompt diffs when you change a system message. For teams shipping LLM features — chat, summarisation, classification — this is the right shape of data.

The limit shows up the moment your model starts calling tools. An LLM observability trace will record that the model emitted a tool-use block with certain arguments. What happened next — whether the tool succeeded, what the API returned, whether it retried, what user identity ran it, what state changed in the underlying SaaS product — sits outside the model's view unless the orchestrator pushes that data back in. Most of these vendors now offer hierarchical traces and tool-call views, but the depth of instrumentation at the tool runtime is where the categories still diverge.

How agent tracing platforms work

Agent tracing platforms treat the tool-call graph as the primary object. The unit isn't a model call — it's a span tree representing the full run: the user's request, the model's reasoning steps, every tool invocation, every retry, every nested sub-agent call. Vendors and open standards here include OpenTelemetry's GenAI semantic conventions, Braintrust (which explicitly combines evals and tracing in one product), LangSmith — now positioned around full agent trace trees and LangGraph integration — and runtime offerings inside frameworks like LangGraph and CrewAI.

Instrumentation looks different. Instead of wrapping the model SDK, agent tracing wraps the orchestrator or the tool runtime. Every tool call gets a span. Every span carries the user identity that authorised it, the arguments passed, the response returned, the error if any, and the parent-child relationships that show how one tool call led to another. When the model decides to call refund_customer and that triggers update_billing_ledger and send_email, you see all three in one trace, in order, with timing.

The payoff is operational. When something goes wrong in production — a duplicate refund, a wrong-customer update, a stuck workflow — you can answer the questions that actually matter. Which tool fired. With what arguments. Under whose identity. Did it succeed, retry, or fail silently. Eval depth varies: some platforms in this category delegate evals to a separate tool, while others (Braintrust being the clearest example) treat evals as a first-class part of the same workflow.

How they compare

It's worth saying up front: the two categories are converging. Most LLM observability vendors now ship agent tracing features, and some agent tracing platforms ship eval workflows. The distinction below is about architectural origin and where each category goes deepest — not a clean capability split.

LLM observability tools
Agent tracing platforms

Primary unit

Model call (prompt + completion)

Tool-call graph (span tree per run)

Best for

Prompt iteration, eval, cost and token tracking

Production debugging, tool-call audit, multi-step runs

Instrumentation point

Model SDK wrapper or proxy

Orchestrator or tool runtime

User identity per call

Supported as a tag, must be passed by the app

Supported, more often propagated automatically across nested spans

Tool-call coverage

Records the model's tool-use block; deeper tool-runtime data depends on the orchestrator

Full execution, retries, nested calls, results

Eval and quality scoring

Strong — built around prompts and completions

Varies; some delegate to a separate tool, others (e.g. Braintrust) bundle evals

Cost attribution

Per model call, per user, per feature

Per run end-to-end, including tool-side cost

Production debugging

Useful for model-level issues

Useful for the "what actually happened" question

When to choose LLM observability tools

Pick LLM observability when the thing you ship is a model surface. Customer support chat where the model answers from a knowledge base. Summarisation features inside a product. Classification pipelines. A copilot whose tool surface is small and stable. In these cases the variable that determines quality is the prompt and the model — not the orchestration around them. You want eval workflows, prompt versioning, A/B testing on system messages, and clean cost attribution per feature.

Also pick LLM observability when your team is small, your agent surface is shallow (one or two tools, called rarely), and you're earlier in the maturity curve. The data is easier to reason about. The tooling is more mature. The cost is lower. You'll outgrow it when your agent starts taking real actions on real systems, but that's a problem for later.

When to choose agent tracing platforms

Pick agent tracing the moment your agent starts writing to production systems on a user's behalf. The questions you'll need to answer in incidents — who did what, when, with what arguments, and what changed — are tool-call questions, not prompt questions. A trace that ends at the model's tool-use block is useless when the support ticket says "the agent refunded me twice."

Pick it when you're running multi-step workflows, sub-agents, or any orchestration where one decision triggers a chain of tool calls. Pick it when you have a real auth model — per-user OAuth, scoped permissions — and you need an audit trail that proves the agent acted as the right user. Pick it when retries, idempotency, and partial failures matter, which is to say: pick it for any agent project past the demo stage. This is also the layer where most agent projects actually stall, as covered in orchestrator vs tools layer: where agent work actually happens.

How Pontil fits

Observability for agents only works if the tools layer itself emits the right signal. A trace that says "tool call succeeded" without recording which user identity ran it, what the underlying API actually returned, or whether a retry caused a duplicate write — that trace doesn't help in an incident.

Pontil sits in the tools layer of the agent stack. We generate connectors from the APIs you already own and run them through a managed runtime, where each tool call executes as the authenticated user — not a shared service account. Every invocation produces a structured span with identity, arguments, result, and timing, ready for export to whichever agent tracing platform you've chosen. The point isn't to replace your observability vendor. It's to make sure the data they're recording reflects what actually happened in your product. You can see how the tools layer fits the rest of the stack on the Pontil product page.

What we'd choose

Most serious agent projects end up running both. LLM observability for the prompt-and-eval loop during development. Agent tracing for production runs, incidents, and audit. The split is real because the questions are different: "is this prompt good" and "what did the agent actually do" don't share a data model — even when one vendor sells you tooling for both.

If you're forcing a single choice today, the variable is whether your agent takes meaningful actions on production systems. If yes, agent tracing first — you cannot debug what you cannot see, and an LLM observability tool will not show you the tool side at the depth you need. If no, LLM observability is enough, and you can defer the tracing decision until your tool surface grows. The mistake to avoid is the inverse: picking LLM observability because it's familiar, then trying to retrofit tool-call visibility through tags and custom metadata. That path produces traces that look complete and aren't, which is worse than no traces at all.

Join our weekly newsletter

Stay up to date on the ever changing agentic landscape.

POSTS

Related content

Agent infrastructure

Platform integration

Orchestrator vs tools layer: where agent work actually happens

7 min read

Agents in production

Why AI agents fail in production (and how to fix it)

7 minute read

Agent infrastructure

Agents in production

Agentic workflows: what they are, how they break, and what makes them production-grade

8 minute read