Skip to main content

Command Palette

Search for a command to run...

Retries Are Not a Fix. They Are a Liability.

Most retry mechanisms in background jobs are unsafe by default. This is a quick way to find out if yours is one of them.

Published
4 min read
Retries Are Not a Fix. They Are a Liability.

Before you read, run this test

If you’re using background jobs, queues, or retries, answer this:

  • If the same task runs twice, does anything break?

  • Can you tell if the task already completed before retrying it?

  • Does every retry carry the correct user / tenant context?

  • Do you mark tasks as done only after confirming the outcome?

  • Can you trace what happened across retries?

If you answered “no” to 2 or more: Your retries are unsafe.

If you answered “no” to 3 or more: Your system will eventually corrupt data.

Why this test exists

We added retries to a system recently.

Error rates dropped. Dashboards looked better.

But a few days later:

  • duplicate emails started showing up

  • duplicate operations increased

  • and in one case, a user from tenant A received an email meant for tenant B

Nothing was failing anymore.

But nothing was correct either.

Why retry mechanisms fail in real systems

A retry is not:

“try again”

A retry is:

“run the same operation again, even if we don’t know what already happened”

That uncertainty is where systems break.

What actually happens in distributed systems

In real systems:

  • API call succeeds but response is lost

  • Database write happens but worker crashes

  • External service processes request but times out

Your system sees a failure.

Reality has already moved forward.

Now the retry runs.

The system retries because it thinks nothing happened. Reality says otherwise.

This is how systems quietly break

  • Payment gets charged twice

  • Email gets sent multiple times

  • Inventory gets updated incorrectly

Retries don’t fix failures.

They repeat them.

It gets worse in multi-tenant systems

We reviewed a system where a user from tenant A received an email meant for tenant B.

No errors. No alerts.

What happened:

  • Job had tenant context

  • Worker partially failed

  • Retry ran without proper isolation

  • Same job executed with wrong tenant context

Now it wasn’t just a duplicate.

It was a data leak.

A retry without tenant isolation is not a retry. It’s a cross-tenant bug.

The real mistake

Most systems assume:

failure means nothing happened

That assumption is wrong.

The most dangerous state is:

something happened, but you don’t know if it did

Retries operate blindly in that state.

How to design safe retries in distributed systems

Retries are fine.

Blind retries are not.

1. Make operations idempotent

If the same job runs twice, the result should not change.

No idempotency → no safe retries.

2. Treat retries as duplicate executions

Not second chances.

Design every job assuming:

it may run more than once

3. Don’t mark success too early

Only mark a job complete after:

the side effect is confirmed

4. Preserve context explicitly

Especially in multi-tenant systems:

  • tenant ID must be part of job identity

  • never rely on implicit or global state

  • retries must carry full context

A safer retry pattern

If you take one thing from this post, use this:

// 1. Check if operation already exists
// 2. If yes → exit
// 3. Perform side effect
// 4. Store result with unique operation ID
// 5. Mark job complete

This gives you:

  • safe retries

  • crash recovery

  • duplicate protection

Where most NestJS setups go wrong

This is common with Bull/BullMQ:

@Process()
async handle(job: Job) {
  await this.paymentService.charge(job.data)
}

Looks fine. Breaks with retries.

Safer version:

@Process()
async handle(job: Job) {
  const exists = await this.repo.find(job.id)
  if (exists) return

  const result = await this.paymentService.charge(job.data)

  await this.repo.save({
    jobId: job.id,
    status: 'completed',
    result
  })
}

Retries now don’t duplicate work.

Summary

  • Retries repeat operations, they don’t fix failures

  • Most systems cannot safely handle duplicate execution

  • Idempotency is required for reliable background jobs

  • Context loss in retries can lead to data leaks

  • Reliability comes from design, not retries

Closing

If retries can break your system, your system was already broken.

In multi-tenant systems, they don’t just break things.

They leak them.

Final line

Failures are visible.
Retries make them invisible.

That’s why they are dangerous.

Scaling JavaScript & Node.js

Part 1 of 18

Learn how to build scalable, high-performance applications in JavaScript and Node.js — this series breaks down real-world solutions for handling large datasets, improving API speed, and writing production-grade backend code.

Up next

Your Background Jobs Are Not Failing. They Are Lying.

You won’t see errors. You won’t get alerts. But your system is already losing data.