Why our retry budget was actually a load amplifier

Heads up

This is a sanitized post-mortem. Numbers are rounded, services renamed, and the root cause has been deliberately isolated so it reads in one sitting.

The incident lasted four hours and thirteen minutes. The trigger was a one-second p99 spike in our orders service caused by a routine garbage collection pause on a single node. The amplifier was a retry helper I wrote myself two quarters ago, with the deliberate intent of making the system more resilient. It was, of course, anything but.

The shape of the helper

The original code was unremarkable. It looked like every retry helper you have ever read.

typescript

// retry.ts — v1, deployed Q4 2025
export async function withRetry<T>(
  fn: () => Promise<T>,
  attempts = 3,
  baseMs = 100,
): Promise<T> {
  let lastErr: unknown
  for (let i = 0; i < attempts; i++) {
    try {
      return await fn()
    } catch (err) {
      lastErr = err
      const backoff = baseMs * 2 ** i
      await new Promise((r) => setTimeout(r, backoff))
    }
  }
  throw lastErr
}

Three attempts. Exponential backoff. Standard library fare. We applied it in roughly seventy call sites across the orders, billing, and inventory services, including, fatefully, on the synchronous read-path to a downstream pricing API.

The maths is the part everyone misses. If fn() succeeds 99.9% of the time under normal conditions, this helper adds essentially zero overhead. But the failure rate is the variable that matters. The moment fn() starts succeeding only 50% of the time — because, say, a single GC pause is making half the requests time out — the amplifier kicks in.

A naive call generates one request. A failed call with this helper generates up to three requests. Across seventy call sites at peak traffic, that meant our request rate to the pricing API roughly tripled the instant it became sick. Three times the load on a service whose latency had already cliffed off because of the GC pause.

Drawing the load curve

I redrew this on a whiteboard the next morning, and it is the single most useful thing I have ever drawn for a junior engineer:

text

   Calls/s sent to downstream                            (amplification)
        ▲
   3000 │        ╭──────────────────╮
        │       │                   │
   2000 │      │       brown-out    │
        │     │      window         │
   1000 │ ───╯                       ╰─── ← normal traffic resumes
        │
        └───────────────────────────────────────▶ time
            ↑                       ↑
          GC pause             clients fail-fast,
          (1 s)                amplification ends

The downstream service was sick for one second. The brown-out lasted four hours because the retry helper kept it sick.

The actual fix: a token-bucket retry budget

The classic remedy is what Google's SRE book calls a retry budget: a token bucket that caps how many retries the system as a whole is allowed to issue per unit of time. When the bucket is empty, retries are converted into immediate failures. The bucket refills proportionally to successful traffic, so a healthy service can absorb a transient blip; an unhealthy one stops piling on.

The key insight is that the budget is global to the call-site, not per-request. Here is the version we now run in production:

typescript

// retry.ts — v2, ships with global budget per "channel"
type Channel = { tokens: number; maxTokens: number; refillPerSec: number; lastTick: number }
const channels = new Map<string, Channel>()

function getChannel(name: string, maxTokens = 100, refillPerSec = 10): Channel {
  let c = channels.get(name)
  if (!c) {
    c = { tokens: maxTokens, maxTokens, refillPerSec, lastTick: Date.now() }
    channels.set(name, c)
  }
  // Lazy refill — no background timer
  const now = Date.now()
  const elapsed = (now - c.lastTick) / 1000
  c.tokens = Math.min(c.maxTokens, c.tokens + elapsed * c.refillPerSec)
  c.lastTick = now
  return c
}

export async function withRetry<T>(
  channel: string,
  fn: () => Promise<T>,
  attempts = 3,
  baseMs = 100,
): Promise<T> {
  let lastErr: unknown
  const c = getChannel(channel)

  for (let i = 0; i < attempts; i++) {
    try {
      return await fn()
    } catch (err) {
      lastErr = err
      // First call always allowed. Retries cost a token.
      if (i > 0) {
        if (c.tokens < 1) {
          // Budget exhausted — fail fast, do not amplify
          throw lastErr
        }
        c.tokens -= 1
      }
      const jitter = Math.random() * baseMs
      const backoff = baseMs * 2 ** i + jitter
      await new Promise((r) => setTimeout(r, backoff))
    }
  }
  throw lastErr
}

Two things changed beyond the budget itself. The retry now requires a channel name, which forces the caller to think about whose budget they are spending. And the backoff now has jitter — without it, every client retries at the same offsets and the load curve gets the same spiky shape, just shifted in time.

What we measured after

We deployed this on a Wednesday. The following Tuesday, the pricing API had another GC pause. The retry budget for that channel hit zero inside three seconds. About 4% of orders saw a one-second error spike. No brown-out. No pager.

A small lesson, but one that has stuck with me ever since: retry without a budget is not resilience, it is a denial-of-service attack on yourself in slow motion. Every retry helper in your codebase should know which channel it is spending from, and every channel should have a bucket it cannot empty faster than it refills.

A short reading list

Google SRE Workbook, chapter 22 — addressing cascading failures
The Tail at Scale, Dean and Barroso, CACM 2013
AWS Architecture Blog — exponential backoff and jitter (Marc Brooker)

If your retry helper does not have a budget, you are one bad afternoon away from this same post-mortem. Add one before the afternoon arrives.

Why our retry budget was actually a load amplifier

The shape of the helper ​

Drawing the load curve ​

The actual fix: a token-bucket retry budget ​

What we measured after ​

A short reading list ​

Read next

The shape of the helper

Drawing the load curve

The actual fix: a token-bucket retry budget

What we measured after

A short reading list