Retry Storms in Go: When Resilience Becomes an Outage
Retries are supposed to save your system.
In one incident, ours took it down faster.
The Failure Pattern
An upstream dependency slowed down. Every service layer retried three times:
- API gateway retried
- Service retried
- Repository client retried
One failed request turned into a burst of duplicated traffic. CPU spiked, queue depth exploded, and the upstream never recovered.
The Buggy Retry Logic
for i := 0; i < 3; i++ {
err := callUpstream(ctx, req)
if err == nil {
return nil
}
time.Sleep(100 * time.Millisecond) // fixed interval, synchronized herd
}
Safer Retry Strategy
Use exponential backoff with jitter, strict caps, and only retry transient errors.
base := 100 * time.Millisecond
max := 2 * time.Second
for i := 0; i < 4; i++ {
err := callUpstream(ctx, req)
if err == nil {
return nil
}
if !isTransient(err) {
return err
}
backoff := minDuration(max, base*time.Duration(1<<i))
jitter := time.Duration(rand.Int63n(int64(backoff / 2)))
sleep := backoff/2 + jitter
select {
case <-time.After(sleep):
case <-ctx.Done():
return ctx.Err()
}
}
Rules We Now Use in Production
- Retry only idempotent operations unless protected by idempotency keys.
- Never stack retries at every layer.
- Enforce retry budgets so failure paths cannot exceed normal traffic by large factors.
- Expose retry metrics (
attempt_count,retry_reason) so storms are visible early.
What Went Wrong in My Incident
- What alerted first: Upstream latency increased, followed by a sudden request-rate surge.
- What misled us: We assumed traffic growth, not retry amplification, caused the spike.
- What confirmed root cause: Attempt-level metrics showed multiple retry layers multiplying a single failure path.
Retries are a multiplier. Make sure they multiply stability, not pressure.