Skip to main content

Command Palette

Search for a command to run...

Your Background Jobs Are Not Failing. They Are Lying.

You won’t see errors. You won’t get alerts. But your system is already losing data.

Published
3 min read
Your Background Jobs Are Not Failing. They Are Lying.

A few months ago, we had a system that was “working fine”.

Jobs were getting processed. Workers were running. No errors in logs.

But users started complaining.

Some notifications were never sent.

Not delayed. Not failed.

Just… missing.

That’s when we realized something:

Our background jobs weren’t failing.

They were lying.

The actual failure

The bug looked like this:

  • Job picked by worker

  • External API call started

  • Worker crashed (OOM / deploy / network)

  • Job was already marked as “completed”

From the system’s perspective:

everything succeeded

From reality:

nothing happened

No retry.
No error.
No signal.

Just lost work.

Why this happens

Most queue systems don’t guarantee execution.

They guarantee delivery, not completion.

That’s a big difference.

If your logic looks like:

  • pick job

  • do work

  • mark done

You are already vulnerable.

Because crashes don’t follow your control flow.

The mental model shift

Stop thinking:

“Did the job run?”

Start thinking:

“Did the side effect happen?”

That’s the only thing that matters.

Concrete example (The Pain)

Let’s say:

  • You send an email

  • Or trigger a payment

  • Or update a ledger

If your job runs twice → problem
If your job runs zero times → bigger problem

And both happen more often than people think.

The real fixes (not checklist, but logic)

1. Move “done” after verification

Don’t mark jobs complete when work starts
Mark them complete when outcome is confirmed

2. Idempotency is not optional

If retry breaks your system, your system is already broken.

Example:

  • Store operation keys

  • Deduplicate by business ID

  • Make side effects safe to repeat

3. Separate “processing” from “commit”

This is where most people mess up.

Instead of:

  • process → done

Do:

  • process → verify → commit

4. Accept that jobs will run multiple times

Design for it.

Don’t try to prevent it.

NestJS angle

If you’re using NestJS + Bull:

The default pattern most people write is this:

@Process()
async handle(job: Job) {
  await this.sendEmail(job.data)
}

This is unsafe.

A better version:

@Process()
async handle(job: Job) {
  const exists = await this.repo.find(job.id)
  if (exists) return

  const result = await this.sendEmail(job.data)

  await this.repo.save({
    jobId: job.id,
    status: 'completed',
    result
  })
}

Now:

  • retries are safe

  • crashes don’t corrupt state

The uncomfortable truth

If your background job system looks “clean” and “simple”

You probably haven’t seen it fail yet.

Because when it does, it won’t throw errors.

It will just silently stop doing what you think it’s doing.

Closing line

Failures in APIs are loud.
Failures in background jobs are silent.

That’s why they are dangerous.