Agent infrastructure
Platform integration
Webhook reliability patterns: how retries, idempotency, dead letter queues, and explicit delivery guarantees combine into infrastructure agents can trust.

Webhooks look like the simplest part of an integration. A producer sends an HTTP POST when something happens. A consumer accepts it and does the work. Then the first network blip lands, the consumer 500s, the producer retries, and the same event gets processed twice. Or three times. Or the retry queue fills up and events get dropped on the floor without anyone noticing.
This article is about the patterns that turn webhooks from a fragile notification mechanism into a delivery channel agents can build on. The argument is straightforward: webhook reliability isn't a single feature you ship — it's a stack of four patterns that have to coexist. Retries with backoff. Idempotency on the consumer side. Dead letter queues for the events that exhaust their retry budget. And explicit delivery guarantees that everyone agrees on before code gets written.
We'll work through each pattern, the trade-offs that don't make it into vendor documentation, and what changes when the consumer of the webhook isn't a human-built integration but an agent.
Most webhook infrastructure was designed for a specific shape of consumer: an integration script someone wrote once, deployed to a long-running server, and forgot about. The script handles a few event types, writes to a database, and the failure mode if something goes wrong is that an engineer notices stale data the next morning.
Agents change that shape. An agent invoking a tool that depends on webhook-delivered state assumes the state is current. If an order-created webhook is still in the retry queue when the agent reads the order list, the agent reasons over a stale picture and acts on it. The failure isn't a stale dashboard — it's a wrong decision, propagated through whatever the agent does next. And because agents act on behalf of authenticated users, the wrong decision shows up in audit logs with a real human's name on it.
The second shift is volume and concurrency. Agent-driven workflows fan out. One user request can trigger ten tool calls in parallel, each of which produces webhooks that other agents subscribe to. Patterns that held up at one webhook per second start dropping events at fifty. We've covered the broader stall pattern in why agent projects stall; webhook reliability is one of the specific places it shows up.
The first thing to notice about webhook retry policies is how much they vary — even between products at the same vendor. Stripe retries with exponential backoff and jitter for up to three days. Shopify updated its policy in September 2024 to 8 retries over roughly 4 hours with exponential backoff and a 5-second per-attempt timeout. GitHub doesn't auto-retry at all — failed deliveries have to be manually redelivered within a 3-day window via the UI or REST API. Twilio varies by product: standard webhooks (Voice, SMS request URLs, Conversations) get a single retry after roughly 15 seconds with no exponential backoff, while Event Streams and a handful of other products use true exponential backoff.
The lesson isn't that one policy is right. It's that "the platform handles retries" is not a meaningful statement until you know which platform and which product. Consumers have to build for the policy they actually get, not the one they assume.
Three retry decisions matter more than the backoff curve itself:
What counts as a retryable failure? The conservative answer is: any 5xx response, any connection failure, any timeout. The dangerous answer is: any non-2xx. A 400 from a consumer means the payload is malformed — retrying it will get you another 400, ten thousand times. A 401 means auth has expired and the consumer needs to refresh credentials, not receive the same event again. Treat 4xx as terminal unless you have a specific reason not to.
How do you cap retries? Two numbers, not one. Cap the number of attempts (so a single event can't retry forever) and cap the total wall-clock duration (so an event delivered four days late isn't useful anyway). Whichever cap hits first triggers the dead letter path.
Do you retry in order, or in parallel? Most webhook systems retry per-event independently, which means event B can be delivered before event A if A is still backing off. If your consumer assumes ordered delivery, you have a bug waiting to happen. Either build the consumer to handle out-of-order events (the right answer) or use a delivery mechanism that preserves order at the cost of head-of-line blocking.
In practice, production systems use both — cap on whichever hits first.
Idempotency is the property that processing the same event twice produces the same result as processing it once. It's the single most important pattern in webhook reliability, and it's the one most producer documentation treats as the consumer's problem.
Which, technically, it is. But the producer has to give the consumer something to work with.
The minimum a webhook producer should send: a stable, unique event ID that doesn't change across retries. Stripe calls it the event ID. GitHub uses a delivery GUID in the X-GitHub-Delivery header. The naming doesn't matter — the contract does. Same logical event, same ID, no matter how many times it's redelivered. If your producer generates a new ID on retry, your consumers cannot deduplicate, and idempotency is impossible.
The consumer side then needs three things:
A store of recently-seen event IDs, with a TTL longer than the producer's maximum retry window. If the producer retries for three days, the consumer's dedup store needs to remember IDs for at least three days plus a safety margin.
Atomic check-and-set semantics. Read-then-write doesn't cut it under concurrent retries — two workers can both check, both find the ID absent, both process the event. Use a database unique constraint, a Redis SETNX, or whatever your stack gives you that's actually atomic.
Idempotent side effects. Even with dedup, the work the consumer does has to be safe to repeat. "Create a record if it doesn't exist" is idempotent. "Increment a counter" isn't. Where the underlying operation isn't naturally idempotent, wrap it in an outbox pattern: write the intended effect to a local table inside the dedup transaction, then process the outbox asynchronously.
The failure mode to watch for is the one where the consumer's dedup store and its side-effect store diverge. Consumer accepts an event, writes to the dedup table, crashes before doing the work. On retry, the dedup table says "already processed" and the work never gets done. The fix is to make the dedup write and the work atomic — same transaction, same database, or use a transactional outbox so the work is durably scheduled before the dedup record commits.
A dead letter queue (DLQ) is the destination for events that exhausted their retry budget without succeeding. The naive view is that the DLQ is a graveyard — a place to dump failures and move on. The useful view is that the DLQ is a queue, with the same operational requirements as the live one: monitored, alertable, drainable, replayable.
Three design decisions shape whether a DLQ does its job:
Granularity. One DLQ per webhook endpoint? Per event type? Per consumer? The trade-off is between operational simplicity (one DLQ to watch) and triage speed (a payment-event DLQ tells you immediately what's broken). For most teams, per-endpoint is the right starting point, with the option to split by event type if one type dominates failures.
Retention. How long does an event stay in the DLQ before it's purged? Long enough to triage during a weekend on-call rotation, short enough that the queue doesn't grow without bound. Seven to thirty days is typical. Tie retention to your incident response SLA, not to arbitrary cost optimisation.
Replay. A DLQ is useless if events can't be re-enqueued after the bug is fixed. The replay path needs to honour idempotency (so a replayed event doesn't double-process if the consumer partially succeeded the first time) and needs to be operator-driven, not automatic. Auto-replay sounds appealing until you replay a thousand events into a still-broken consumer and fill the DLQ again.
The single best signal a webhook system is mature: DLQ depth is a top-line metric on the operations dashboard, alongside delivery latency and success rate. The single best signal it isn't: nobody knows what's in the DLQ right now.
Every webhook system makes one of three delivery guarantees. The problem is that most teams never explicitly choose, and producers and consumers end up with mismatched assumptions.
At-most-once: each event is delivered zero or one times. Simple to implement (no retries), useless for state synchronisation, fine for fire-and-forget notifications where loss is acceptable. This is effectively what GitHub gives you by default — no automatic retries, manual redelivery only.
At-least-once: each event is delivered one or more times. The default for almost every production webhook system that retries automatically. Requires idempotency on the consumer side to be safe. This is what Stripe and Shopify do, and what Twilio's Event Streams product does, even when the documentation doesn't say so plainly.
Exactly-once: each event is delivered exactly one time. Marketing-friendly, operationally impossible across a network boundary without consumer cooperation. What vendors who claim exactly-once actually mean is at-least-once delivery plus producer-assigned event IDs that let the consumer collapse duplicates. The exactly-once property is jointly produced by both sides, not unilaterally by the producer.
Write the guarantee into the API contract. "We deliver each event at least once. Event IDs are stable across retries. Consumers must deduplicate." That sentence, in your developer documentation, prevents more bugs than any amount of retry tuning.
For more on why explicit contracts matter once agents enter the picture, we wrote about it in API products are not the same as agent-ready products.
The four patterns above — retries, idempotency, DLQs, explicit guarantees — are the standard webhook reliability stack. They were designed for human-built consumers. Agents stress them in specific ways.
Agents are short-lived. A traditional consumer is a long-running service with a database and a worker pool. An agent runs for the duration of one user request, makes a tool call, gets a response, and dies. There's no persistent dedup store inside the agent, no DLQ the agent owns, no retry loop the agent runs. All of that has to live in the tools layer the agent invokes — the runtime that holds the webhook subscription, owns the idempotency store, and presents the agent with the current state when it asks.
Agents read state, they don't subscribe to events. An agent invoking list_orders expects the result to reflect every order that exists, including ones whose creation webhooks are still in the retry queue. The webhook layer has to converge fast enough that the read-after-write window doesn't break the agent's reasoning. In practice, that means tighter retry caps for state-sync events than for notification events, and a synchronous fallback (read from source of truth) when freshness matters.
Agents act as the authenticated user. A webhook handler that processes events under a shared service account loses the audit trail the moment the agent is involved. The handler needs to either preserve the original user context across the webhook delivery (signed payload with user identity) or defer the user-scoped work to a downstream tool call that runs under the right identity.
The webhook reliability patterns we've described are well-understood. The work is in implementing them consistently across every product an established SaaS company already owns — most of which were built before agents were a consideration.
That's the gap Pontil closes. We sit in the tools layer of the agent stack, generating and maintaining connectors from the APIs that already exist in your codebase. Pontil's execution engine handles the retry, backoff, state, and error handling for tool and connector invocations — which is the same resilience surface this article has been describing, consolidated in one layer instead of scattered across product code. Tool calls execute as the authenticated user, so the audit trail survives the round trip.
If the bottleneck on your agent project is that webhook reliability has to be solved separately for every product in the portfolio, that's the situation Tools-as-a-Service is designed for.
The honest answer is that the patterns haven't changed much in a decade — exponential backoff, idempotent consumers, dead letter queues, explicit delivery guarantees. What's changed is the cost of getting them wrong. A webhook bug used to mean a stale dashboard. Now it means an agent acts on incomplete state and the wrong thing happens with a real user's name attached.
The teams that ship reliable webhook delivery in 2026 are the ones who treat the four patterns as a single coupled system, not four independent features. Retries without idempotency cause duplicates. Idempotency without a DLQ hides systematic failures. A DLQ without an explicit delivery guarantee leaves consumers guessing about what they're supposed to handle. All four, or none of the four — partial implementations are worse than no implementation at all, because they create the illusion of reliability without the substance.
For agent-era consumers, add a fifth requirement: the webhook layer has to converge fast enough, and present state coherently enough, that an agent reading the surface immediately after an event fires sees the world the event described. That's the standard the next decade of webhook infrastructure has to clear.
Stay up to date on the ever changing agentic landscape.