The outage is over. The retry storm is starting.

A dependency comes back. Failed jobs wake up together. Recovery traffic fights production on the same cluster.

In data platforms, retries become load.

Correct retries can still hurt

The dependency really was down. Waiting made sense. Retrying was not stupid.

Then everything came back at once.

That is the part many systems ignore. A retry policy is not just local error handling. It is a platform-wide traffic policy.

AWS has the cleanest math: five layers with three retries each can create 243x load amplification.

What goes wrong

Retry storms create several failure modes at once:

  • compute amplification
  • queue starvation
  • duplicate writes
  • delayed recovery for critical workloads
  • noisy dashboards that make the platform look unstable after the root outage ended

The worst version is a retry that succeeds technically but corrupts downstream state because the write path was not idempotent.

Better retry policy

Retries should be selective. Queue pressure, temporary transport failures, cluster capacity problems, and ambiguous commit states can deserve another attempt. Deterministic data errors usually do not.

Then add platform controls:

  • isolate ad-hoc and scheduled workloads
  • give critical jobs separate queues
  • cap retry concurrency
  • make idempotency explicit
  • quarantine ambiguous writes when correctness is uncertain
  • autoscale for recovery traffic, not just steady state

The real end of an outage

An outage is not over when the dependency comes back.

It is over when the retry wave stops hurting production, scheduled jobs catch up, and the platform has not created a second incident through recovery traffic.

Retries are reliability tools. They are also cost tools. Treat them like both.