Appearance
There is a Go bug I have now fixed in four different production codebases. It does not look like a bug — it looks like idiomatic, defensive code that any reviewer would wave through. It only becomes visible when you look at the goroutine profile a week into a deploy and notice that the number is climbing in a straight line.
The shape of it is this.
The pattern
You have a handler that wants to enforce a per-request deadline plus a cancellation signal. The standard recipe:
go
func handle(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithCancel(r.Context())
// ... do work with ctx ...
cancel()
}You call cancel() at the end. The reviewer sees it. Everyone moves on.
Now consider the version that grew over six months:
go
func handle(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithCancel(r.Context())
result, err := doWork(ctx)
if err != nil {
http.Error(w, err.Error(), 500)
return // ← cancel() not called
}
json.NewEncoder(w).Encode(result)
cancel()
}Spot it? On the error path, cancel is never invoked. The context package documents this exactly: every WithCancel must be paired with a cancel(), and the vet tool will warn you. But vet is a lint, and lints get ignored in CI for the usual reasons.
The fix is defer cancel(), and the fix is universal. There is no situation where you should not be deferring it:
go
func handle(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithCancel(r.Context())
defer cancel()
// ... go ahead, return wherever you like ...
}So far this is standard advice. Here is where it gets interesting.
The version that defeats defer cancel()
The leak we keep finding is not the missing defer. It is this:
go
func (s *Server) startBackgroundJob() {
ctx, cancel := context.WithCancel(context.Background())
go func() {
defer cancel()
for {
select {
case <-ctx.Done():
return
case <-time.After(30 * time.Second):
s.doWork(ctx)
}
}
}()
s.jobs = append(s.jobs, cancel)
}The author thought: "I'll store the cancel so Shutdown can stop the worker. The defer cancel() covers the panic case. We're good." It is approximately the most reasonable thing you can write.
The leak is subtle: s.jobs is appended to but never read from in 80% of the lifecycle. The author wrote a Shutdown method that iterates s.jobs — but Shutdown is only called when the process is shutting down anyway. Every time the server is reloaded, hot-reconfigured, or has its background-job set rebuilt, the old cancel functions are dropped on the floor with the rest of the old Server value. The goroutines they would have cancelled live forever.
You can see this in runtime.NumGoroutine() — a sawtooth that never quite returns to baseline. After a few hundred config reloads, the process has a couple thousand orphaned goroutines, each holding a *Server reference and the closure heap it captured.
The fix nobody writes the first time
The fix is to stop tying lifecycle to "I will remember to call cancel". Tie it to a sync.WaitGroup and a single owning context:
go
type Server struct {
rootCtx context.Context
rootCancel context.CancelFunc
wg sync.WaitGroup
}
func NewServer() *Server {
ctx, cancel := context.WithCancel(context.Background())
return &Server{rootCtx: ctx, rootCancel: cancel}
}
func (s *Server) startBackgroundJob() {
s.wg.Add(1)
go func() {
defer s.wg.Done()
ticker := time.NewTicker(30 * time.Second)
defer ticker.Stop()
for {
select {
case <-s.rootCtx.Done():
return
case <-ticker.C:
s.doWork(s.rootCtx)
}
}
}()
}
func (s *Server) Shutdown() {
s.rootCancel()
s.wg.Wait()
}There is now exactly one cancel function, owned by the server. There is one wait group that knows about every background goroutine. Reconfiguration that wants to stop the workers stops all of them and waits for them to exit. There is no []CancelFunc to forget about.
This is a small change but it is structural, not local — you cannot review it into existence by squinting at one handler. The context package's design rewards composition; the failure modes reward ownership. They are not the same property.
Catching it
I keep two checks in CI now and one runtime alert. The CI checks are:
sh
# 1. govet's lostcancel
go vet ./...
# 2. staticcheck SA1012 / SA1029, plus context-related warnings
staticcheck ./...And in production I track runtime.NumGoroutine() as a gauge in the metrics pipeline. If it ever increases monotonically over 24 hours, page someone. The graph tells you exactly which deploy did it, and the leak is always within ~200 lines of the offending merge.
The tools have been telling us about this for years. We just keep building structures that work around them. The pattern to internalize is one owner per cancel, one place to wait. Anything else is hope.