Why our retry budget was actually a load amplifier
A post-mortem on the day a thirty-line retry helper turned a one-second blip into a four-hour brown-out. The maths is not subtle once you draw it.
Field notes on distributed systems, resilient services, and the craft of software architecture — written by Alex Chen from the workshop.
A post-mortem on the day a thirty-line retry helper turned a one-second blip into a four-hour brown-out. The maths is not subtle once you draw it.
TypeScript 5.5 added inferred type predicates. Most of what you read about it understates how much pattern-matching code you can now delete.
A small command-line tool, written from scratch, that does one useful thing well. No dependencies beyond the standard library and…
A context.WithCancel that is never cancelled creates a goroutine that never exits. The pattern looks fine. The leak shows up as a…
The fastest path to a faster service is almost always removal, not addition. A short report on a 41% p99 reduction we got by…
Adding an idempotency-key header is easy. Making the implementation actually correct under retry, partial failure, and concurrent…
Most Python async code handles success well and failure poorly. Here are the two cancellation bugs I have now seen in five…
A short introductory note from the editor. The kind of post you can skip, but here is what to expect if you stay.
Two issues a month. New essays, the post-mortems I’m reading, and the architectural sketches that didn’t make it into the writing. No tracking, no upsells.
A post-mortem on the day a thirty-line retry helper turned a one-second blip into a four-hour brown-out. The maths is not subtle once you draw it.
TypeScript 5.5 added inferred type predicates. Most of what you read about it understates how much pattern-matching code you can now delete.
A small command-line tool, written from scratch, that does one useful thing well. No dependencies beyond the standard library and clap.
A context.WithCancel that is never cancelled creates a goroutine that never exits. The pattern looks fine. The leak shows up as a slow memory climb over days.
The fastest path to a faster service is almost always removal, not addition. A short report on a 41% p99 reduction we got by deleting two layers we no longer needed.
Adding an idempotency-key header is easy. Making the implementation actually correct under retry, partial failure, and concurrent submission takes more care than the API docs let on.
Most Python async code handles success well and failure poorly. Here are the two cancellation bugs I have now seen in five different production codebases.
A short introductory note from the editor. The kind of post you can skip, but here is what to expect if you stay.