Skip to content
Performance·2026-05-18·5 min read

How we cut p99 by deleting code

The fastest path to a faster service is almost always removal, not addition. A short report on a 41% p99 reduction we got by deleting two layers we no longer needed.

The fastest path to a faster service is almost always removal, not addition. This is not a deep insight, but it is one we relearn every two or three quarters when a profiler nudges us to remember.

Last sprint our checkout API's p99 latency was 480ms, with a p50 of 90ms. The shape of the histogram — a long, fat right tail — told us we were not bottlenecked on a single hot path. We were paying small taxes everywhere, and the tail was where they all compounded. We cut p99 to 285ms over four days. We did it by deleting code.

What was there before

The request path looked like this:

text
client → CDN → API gateway → auth proxy → orchestrator → orders service → DB

                                              ├── feature flag service
                                              ├── pricing cache
                                              └── address normalizer

Three of those hops had been added at three different times by three different people for three plausible reasons. The auth proxy was a sidecar that pre-validated JWTs before the orchestrator parsed them. The orchestrator was a thin Node.js process that fanned out to the dependent services and aggregated responses. The address normalizer was a small service that ran USPS-style cleanup on the shipping address.

None of them were "wrong." Each was reasonable at the time. But they had quietly stopped justifying their cost.

The auth proxy

The original argument for the auth proxy was that we could centralize JWT validation and reject malformed tokens before they reached any of the services. This made sense when we had eight services consuming auth. We now have two. The proxy was adding a hop, deserializing the request, validating the JWT, and re-serializing — for the benefit of two downstream consumers that were already doing JWT validation themselves as defence-in-depth.

I deleted the proxy and pushed the validation into a shared library that both services already imported. p99 dropped by 38ms. p50 dropped by 6ms.

The library is a hundred lines. Here is the heart of it:

go
// auth/verify.go
package auth

import (
    "errors"
    "github.com/golang-jwt/jwt/v5"
)

type Claims struct {
    UserID string `json:"sub"`
    Tier   string `json:"tier"`
    jwt.RegisteredClaims
}

func Verify(token string, publicKey any) (*Claims, error) {
    parsed, err := jwt.ParseWithClaims(token, &Claims{},
        func(t *jwt.Token) (any, error) {
            if t.Method.Alg() != "RS256" {
                return nil, errors.New("unexpected signing method")
            }
            return publicKey, nil
        })
    if err != nil {
        return nil, err
    }
    claims, ok := parsed.Claims.(*Claims)
    if !ok || !parsed.Valid {
        return nil, errors.New("invalid claims")
    }
    return claims, nil
}

That is the entirety of what the sidecar was doing, plus a few hundred lines of glue.

The orchestrator

The orchestrator was a Node.js service whose only job was to call three downstream services in parallel and merge the responses into a single JSON payload. We had built it to keep service-call complexity out of the client.

The trouble was that all three downstreams were now reachable from the orders service directly, and the orders service was already calling two of the three. The orchestrator was making redundant requests and serving as a serialization stop.

I deleted the orchestrator and moved its three-way fanout into the orders service, where it could share the request context and skip the duplicate calls. The merge happened inline, in Go, with no extra allocation.

go
// orders/checkout.go — fanout merged into the request handler
type checkoutResp struct {
    Order    *Order
    Pricing  *Pricing
    Flags    map[string]bool
}

func (h *Handler) Checkout(ctx context.Context, req *CheckoutRequest) (*checkoutResp, error) {
    var (
        order   *Order
        pricing *Pricing
        flags   map[string]bool
        oErr, pErr, fErr error
    )

    var wg sync.WaitGroup
    wg.Add(3)
    go func() { defer wg.Done(); order, oErr = h.store.Place(ctx, req) }()
    go func() { defer wg.Done(); pricing, pErr = h.pricing.For(ctx, req.SKUs) }()
    go func() { defer wg.Done(); flags, fErr = h.flags.For(ctx, req.UserID) }()
    wg.Wait()

    if err := errors.Join(oErr, pErr, fErr); err != nil {
        return nil, err
    }
    return &checkoutResp{Order: order, Pricing: pricing, Flags: flags}, nil
}

This was the largest single win. p99 dropped by 92ms. The orchestrator had been adding two serialization round-trips on every request, and removing them removed both the latency and a class of partial-failure bugs we had been quietly papering over.

The address normalizer

This one was the most painful to remove because it was real, useful code. It cleaned up shipping addresses to USPS format before they reached the warehouse. The catch was that the warehouse's intake system had quietly added its own normalizer two years earlier, and the two were doing the same work twice.

We deleted ours. p99 dropped by another 65ms, which was the in-flight serialization cost we had been paying for a no-op.

What we did not do

We did not add a cache. We did not add a load balancer. We did not rewrite anything in a faster language. We did not even change a single algorithm. We removed three hops from the request path and let the existing code run a bit closer to the metal.

The before-and-after p99 trace, in microseconds, shrunk from this:

text
  ▁▂▄▆█▆▄▂▁          ▁▂▃▄▅▆▇█▇▆▄▃▂▁
  0    100     200    300    400    500   ms
       ^^^             ^^^      ^^^
       proxy           orch     normaliser

To this:

text
  ▁▂▄▇█▇▄▂▁
  0    100     200    300                   ms

The principle

The principle behind this is older than I am. Every layer in your request path costs you a serialization, a network hop, a re-validation, and a slice of error-handling surface area. Layers earn the right to exist by doing something the layers around them cannot. When the surrounding system absorbs that responsibility, the layer should leave.

Most architectures grow by accretion. The fastest way to make yours faster is to look at what has accreted, and ask, of each thing, whether it would be added today if it were not already there. The ones that would not be should go. Often they take 41% of your p99 with them.

For a longer treatment of this idea, see Carlos Becker's piece on simplification and the chapter on "Removal as a refactoring" in Kent Beck's Tidy First?. Both are short and worth your time.