Agent infrastructure

Agents in production

Tool calling vs function calling: the same mechanism, two production realities

Tool calling vs function calling: same mechanism, different production realities. Where the API contracts diverge, how reliability breaks, and how to evaluate.

8 minute read
Decorative imagery showcasing Pontil's brand

The terms tool calling and function calling get used interchangeably in foundation model documentation, vendor marketing, and engineering Slack channels. They describe the same underlying mechanism: a model emits a structured request, an external system executes it, and the result feeds back into the model's context. But the framing matters. Function calling is what OpenAI shipped in 2023 — a narrow contract between a model and a developer-defined function. Tool calling is the broader operational pattern that emerged once those functions started representing real product capabilities, executed across production systems, on behalf of authenticated users.

This piece argues that the gap between the two terms is the gap between a demo and a deployment. The mechanism is the same. The reliability requirements are not. Below: what each term actually means, why the API contracts diverged, where production reliability breaks, and what an honest evaluation framework looks like for teams past the prototype stage.

Where the terms came from, and why both are still in use

OpenAI introduced function calling in June 2023. The original API let developers describe a function with a JSON schema, pass that schema to the model, and receive a structured invocation back. The model didn't run the function — it suggested one. The developer's code did the work and returned the result. That contract was clean, narrow, and developer-facing.

By late 2023, OpenAI had renamed the field tools in its Chat Completions API (announced at DevDay, November 2023, alongside parallel tool calls), with the same shape carried forward into the Responses API. Anthropic shipped tool use in beta in November 2023 (with Claude 2.1) and reached GA in May 2024, using tools from the start. Google's Gemini followed. The rename wasn't cosmetic. By the time agents were doing more than answering questions — reading calendars, posting messages, updating CRM records — calling them "functions" understated what was actually happening. A function returns a value. A tool reaches out and changes something in a system you don't fully control.

Both terms persist because both describe true things. Function calling is the primitive: a model emits a structured call, code executes it. Tool calling is the pattern: that primitive applied to capabilities in production systems with auth, side effects, latency, and failure modes. Engineers who say function calling usually mean the API surface. Engineers who say tool calling usually mean the operational reality.

What's actually the same underneath

Strip the marketing and the mechanism is identical across providers. The model receives a list of available capabilities, each described with a name, a description, and a JSON schema for parameters. During generation, the model can produce a structured output that names one of those capabilities and supplies arguments matching the schema. Your code parses the output, runs whatever the capability points to, and returns a result string. The model reads that result and continues.

Three things are true of every implementation:

  • The model never executes anything. It only proposes calls.
  • The schema is the contract. Bad schemas produce bad calls.
  • Results return as strings the model has to re-parse, even if your underlying system returned structured data.

This is why MCP — short for Model Context Protocol (MCP) — didn't replace function calling. It standardised the transport between a model client and an external tool server, but the model-side primitive stayed the same. A tool exposed over MCP still arrives at the model as a name, a description, and a schema. We've covered this distinction in detail in what is MCP in AI — worth reading if MCP framing is muddying your team's planning.

Where the API contracts actually diverge

The mechanism is the same. The contracts aren't. Three differences matter when you're building beyond a prototype.

Function calling (early framing)
Tool calling (current framing)

Scope

One function, one call per turn

Multiple tools, parallel calls per turn

Execution model

Synchronous, in-process

Often async, across services, with auth

Failure surface

Schema errors, bad arguments

Schema errors, auth expiry, rate limits, partial failure, drift

Identity

Usually the developer's API key

The end user's identity, scoped per call

Maintenance

Function signature changes on deploy

Tool contracts drift as products change


The parallel calls shift matters more than it sounds. OpenAI shipped parallel tool calls at DevDay in November 2023, alongside the tools rename. Anthropic followed. Once a model can request three tools in one turn, your execution layer has to handle partial failure: tool A succeeds, tool B times out, tool C returns an auth error. The model needs all three results back in a structured form it can reason about. "Function calling" framing — one call, one result, one next turn — doesn't cover this.

Identity is the other divergence that catches teams late. Early function calling assumed the developer's credentials. Tool calling in production has to execute as the authenticated user — their permissions, their data visibility, their audit trail. A shared service account is a security and compliance failure waiting to happen. The runtime layer has to handle token refresh, scope enforcement, and per-call identity propagation. None of that is in the model-side API.

Why tool calling production reliability is the actual problem

In a demo, tool calling works almost every time. The schema is hand-written. The tool does one thing. The auth is your own. In production with real users and real products, the failure modes compound.

Schema drift. Your product's API changes. The tool description doesn't. The model keeps emitting calls against the old schema. Calls fail silently or — worse — succeed with wrong arguments. This is the failure mode bespoke connectors generate at scale, and it's the reason connector maintenance cost becomes the dominant line item in any portfolio-scale agent project.

Tool description quality. Models pick tools by reading their descriptions. Two tools with overlapping descriptions confuse the model. A description that names parameters but doesn't explain when to use the tool gets ignored. Most teams discover this after their agent starts picking the wrong tool a meaningful fraction of the time in production — a failure mode well-documented in tool-selection benchmarks like the Berkeley Function-Calling Leaderboard.

Result formatting. Tools return strings. If your tool returns a 4KB JSON blob, the model spends tokens parsing it. If it returns a 400KB blob, you've blown the context window. Production-grade tools shape their results — they don't dump raw API responses.

Auth lifecycle. OAuth tokens expire. Refresh tokens rotate. The model has no concept of any of this. The runtime has to catch a 401, refresh transparently, retry, and return a clean result — or, if refresh fails, return an error the model can actually reason about ("the user needs to reconnect their account") rather than a stack trace.

Rate limits and backoff. Models will happily fire three parallel calls into the same rate-limited API. Without a runtime that holds the limit and queues correctly, you get cascading failures that look like model unreliability but are really infrastructure.

None of this lives in the model-side API. It lives in whatever sits between the model and the product — what we've called the tools layer. In our experience, the model API is a small fraction of the production picture; the bulk of the engineering work lives in the layer underneath.

How to evaluate function calling agents for real workloads

If you're picking between providers or designing your own tool layer, the surface comparison (does it support parallel calls, what's the JSON schema dialect, how big is the context window) is the easy part. The harder evaluation is what happens when things go wrong.

A working checklist

  • Can your tool layer execute as the authenticated user? Per-call identity propagation, not a shared service account. This is non-negotiable for any tool that touches user data.
  • Does it handle partial failure across parallel calls? Three tools called in one turn, one fails — does the model get a structured error or does the whole turn die?
  • Does it catch auth expiry transparently? Token refresh should never bubble up to the model as a generic error.
  • How does it handle schema drift? When the underlying API changes, do your tool definitions update automatically, or does an engineer manually edit a registry?
  • Can you observe tool calls? Latency per tool, error rate per tool, which tools the model picked vs. which it should have picked. Without this, debugging is guesswork.
  • What's the result shape contract? Tools that return raw API responses will burn your context budget. Tools should return shaped, summarised, model-ready strings.

Most teams build the happy path first and discover the failure surface in production. The honest evaluation is the inverse: start with the failure modes, work back to the architecture you need.

How Pontil fits

The argument above lands on a single point: the mechanism is the same, but tool calling in production needs a layer the model API doesn't provide. That layer is what Pontil builds.

Pontil is a Tools-as-a-Service platform. We generate tools from the APIs and codebases an established SaaS company already owns, run them as the authenticated user, and keep them current as the underlying product changes. That covers the production reliability surface — auth lifecycle, per-call identity, rate limits, schema drift maintenance — that sits outside the function-calling primitive itself. The model still emits structured calls the same way it did in 2023. What changes is whether those calls reach a real product surface reliably, on behalf of the right user, when the product team ships a release tomorrow. For teams whose agent project is past the prototype and stalled on the access problem, that's the gap worth closing. The why agent projects stall breakdown covers the pattern in more detail.

What does the next year of tool calling look like?

The primitive will keep stabilising. Schema dialects will converge. Parallel calls and streaming results will become table stakes across providers. MCP or something like it will become the default transport between model clients and tool servers. The interesting question isn't whether tool calling gets better — it will — but where teams choose to invest the engineering hours that the model API doesn't cover.

The answer most teams arrive at, eventually, is that the model layer is commoditising and the tools layer is where the real work lives. The terminology will probably settle on "tool calling" as the umbrella term, with "function calling" reserved for the narrow API-surface meaning. But the more useful split isn't linguistic. It's between the slice of the problem the model providers solve and the much larger slice they leave to you. Pick your stack accordingly.

Join our weekly newsletter

Stay up to date on the ever changing agentic landscape.

POSTS

Related content

Agent infrastructure

Platform integration

What is MCP in AI? A plain answer, and what it doesn't fix

5 minute read

Agent infrastructure

The agent stack: a map for platform teams

6 minute read

Agent infrastructure

Platform integration

Orchestrator vs tools layer: where agent work actually happens

7 min read