8 min read

Debugging Laravel Queue Backlogs in Production Without Losing Jobs

When Laravel jobs start piling up in production, the fastest fix is rarely the safest one. This guide shows how to tell the difference between stuck workers and genuinely slow jobs, how to inspect the backlog without guessing, and how to restore throughput without dropping or duplicating work.
Feature image for Debugging Laravel Queue Backlogs in Production Without Losing Jobs

A queue backlog usually shows up at the worst possible time. Emails stop going out, imports stall halfway through, web requests start timing out because someone moved work “into the queue,” and the first instinct is to restart everything and add more workers.

Sometimes that works. Sometimes it makes things worse.

In production, queue backlogs are rarely just “the queue is slow.” More often, one of three things is happening: workers are dead or blocked, jobs are taking much longer than expected, or a subset of jobs is failing and being retried in a way that starves everything else. If you treat all three the same way, you risk duplicate processing, lost work, or a backlog that comes right back ten minutes later.

This article walks through a safe way to debug Laravel queue backlogs in production: identify what is actually piling up, separate stuck workers from slow jobs, and recover throughput without dropping jobs under pressure.

Start with the right question

When a backlog grows, the real question is not “how do I clear the queue?” It is:

Why is work entering the queue faster than it is completing?

That framing matters because the fix depends on which side is failing:

  • Jobs are not being consumed at all
  • Jobs are being consumed, but too slowly
  • Jobs are failing and coming back repeatedly
  • One queue is starving another
  • An external dependency is slowing everything down

If you skip that distinction, you end up doing production guesswork: restarting workers, scaling horizontally, or deleting jobs before you understand the failure mode.

What a queue backlog usually means

In Laravel, a backlog is usually visible as one or more of these symptoms:

  • jobs table growing rapidly for the database driver
  • Redis queue length increasing and not coming down
  • Horizon showing pending jobs climbing while throughput flattens
  • Old jobs remaining unprocessed for much longer than normal
  • Users reporting delayed side effects like emails, webhooks, exports, or billing actions

The underlying causes tend to cluster into a few categories.

1. Workers are not actually processing jobs

This is the cleanest failure mode. Workers may have crashed, been killed during deployment and not restarted, become stuck on a blocking call, or be running against the wrong queue.

In this case, pending jobs rise and completed jobs fall to nearly zero.

2. Jobs are processing, but each job got slower

This is more subtle. Workers are alive, but something changed: a third-party API got slower, an N+1 issue slipped into a job, a report now processes 10x more data, or a lock causes jobs to wait on each other.

Here, throughput drops even though workers appear healthy.

3. Jobs are retrying aggressively

A job that times out or throws an exception may be released back onto the queue over and over. If enough of those accumulate, they can consume worker time without making progress.

This is especially painful when retries share the same queue as business-critical jobs.

Diagnose with data, not restarts

Before changing worker counts or flushing anything, collect a few facts.

Step 1: Check whether workers are alive

If you use Horizon, start there. Look at:

  • Pending jobs by queue
  • Active processes
  • Throughput over time
  • Recent failed jobs
  • Runtime distribution if available

If you are not using Horizon, inspect your worker processes directly:

ps aux | grep "queue:work"
supervisorctl status
systemctl status laravel-worker

Then verify workers are listening to the queue you think they are:

php artisan queue:work --queue=default,emails

A common production mistake is having jobs dispatched to emails while workers only listen to default.

Step 2: Measure backlog age, not just backlog size

A queue length of 5,000 may be fine if jobs are lightweight and throughput is high. A queue length of 200 may be a production incident if the oldest job is now 45 minutes old.

What matters is whether the system is catching up.

If you use Redis, inspect queue depth and oldest timestamps through Horizon or your own metrics. For database queues, check when jobs were created:

SELECT queue, COUNT(*) AS total, MIN(created_at) AS oldest_job
FROM jobs
GROUP BY queue
ORDER BY total DESC;

That tells you which queue is in trouble and how long work has been waiting.

Step 3: Separate stuck workers from slow jobs

This is the critical split.

If workers are stuck, you typically see:

  • Very low completion rate
  • Few or no log entries from job handle() methods
  • Long-running worker processes with no job turnover
  • Horizon active workers that do not complete jobs

If jobs are merely slow, you see:

  • Workers stay busy
  • Some jobs complete, just more slowly than before
  • Runtime increases cluster around specific job types
  • CPU, database, or external API latency increases during processing

Add timing around important jobs if you do not already have it:

public function handle(): void
{
    $startedAt = microtime(true);

    try {
        // Job logic...
    } finally {
        logger()->info('job.completed', [
            'job' => static::class,
            'duration_ms' => (int) ((microtime(true) - $startedAt) * 1000),
            'attempts' => $this->attempts(),
        ]);
    }
}

This is simple, but in production it quickly shows whether one job class suddenly went from 300ms to 25s.

Step 4: Inspect failures and retries

Check failed jobs, but also check repeated attempts on jobs that have not failed permanently yet.

php artisan queue:failed

Look for patterns:

  • Timeouts
  • Deadlocks
  • API rate limits
  • Network errors
  • Serialization issues
  • Model not found exceptions after deletes

Also review tries, backoff, and timeout on the job class. A bad combination can quietly amplify incidents.

class SyncWebhook implements ShouldQueue
{
    public $tries = 5;
    public $timeout = 30;

    public function backoff(): array
    {
        return [10, 30, 60, 300];
    }
}

If retries happen immediately with no useful backoff, a struggling dependency can flood your workers with the same failing work.

Step 5: Check for queue starvation

Not all backlog incidents are global. Sometimes one heavy queue blocks everything else because all workers are shared.

For example:

  • PDF generation jobs take 90 seconds each
  • Email jobs take 300ms each
  • Both run on default
  • A bulk export floods the queue
  • Password reset emails now wait 20 minutes

That is not just a performance issue. It is a queue design issue.

Safe ways to recover throughput

Once you know the bottleneck, fix the specific cause instead of reaching for the most dramatic action.

1. If workers are dead or wedged, restart them gracefully

Laravel workers are long-lived processes. They can hold stale state, leak memory, or get stuck in bad external calls.

Use a graceful restart first:

php artisan queue:restart

That tells workers to finish current jobs and restart cleanly. If you use Supervisor, make sure it actually respawns them.

[program:laravel-worker]
process_name=%(program_name)s_%(process_num)02d
command=php /var/www/app/artisan queue:work redis --sleep=3 --tries=3 --timeout=90
numprocs=4
autostart=true
autorestart=true
stopasgroup=true
killasgroup=true
stdout_logfile=/var/www/app/storage/logs/worker.log

Trade-off: a graceful restart is safe, but if jobs are hard-blocked on a network call longer than your timeout, recovery may still be slow. Avoid killing workers blindly unless you understand what the running jobs do and whether they are idempotent.

2. If jobs are slow, isolate the expensive work

When one class of jobs dominates worker time, move it to its own queue.

ProcessLargeExport::dispatch($export)->onQueue('exports');
SendOrderEmail::dispatch($order)->onQueue('emails');

Then run dedicated workers:

php artisan queue:work redis --queue=emails --sleep=1 --tries=3
php artisan queue:work redis --queue=exports --sleep=1 --tries=1 --timeout=300

This protects latency-sensitive jobs from heavy batch work.

Trade-off: more queues improve isolation, but they add operational complexity. Only split queues where latency or resource profiles are meaningfully different.

3. If external dependencies are slow, reduce concurrency pressure

Sometimes the backlog is caused by pushing too hard on a dependency that is already failing. Adding workers can increase retries, rate limits, and timeout storms.

In that case, lower concurrency for the affected queue and increase backoff. You want controlled recovery, not a larger failure blast radius.

This is especially true for jobs that call payment gateways, CRMs, or webhook endpoints.

4. If retries are consuming capacity, fix retry behavior before scaling

Scaling workers on top of bad retry settings often just burns more CPU to fail faster.

Use:

  • Sensible tries
  • Exponential backoff
  • Clear timeout values
  • Separate failed-dependency queues when needed

If a dependency is down, it is often better for jobs to back off for several minutes than to cycle every few seconds.

5. If jobs may be re-run, make them idempotent

This is what makes recovery safe.

If you restart workers, release timed-out jobs, or retry failures, some jobs may execute more than once. A job that charges a card, sends a webhook, or creates a record must tolerate retries.

Examples of idempotent protections:

  • Store an external idempotency key before calling a payment API
  • Use unique database constraints for “create once” operations
  • Check whether a side effect already happened before repeating it
  • Record processed event IDs

A queue system is far easier to operate when retrying work is safe.

Common mistakes that make incidents worse

Flushing the queue to “restore service”

Deleting queued jobs may reduce dashboard anxiety, but it also discards real business work. Unless the jobs are proven safe to regenerate, queue:clear is usually an incident multiplier, not a fix.

Increasing worker count without checking the bottleneck

If the database is saturated or a third-party API is timing out, more workers can make the whole system less stable.

Running everything on one queue

This works until it does not. Mixing user-facing notifications with bulk processing guarantees that one day a low-priority spike will delay high-priority work.

Using timeouts that do not match reality

If a job normally needs 120 seconds but worker timeout is 60, you create duplicates and retries under load. If timeout is too high, stuck jobs occupy workers too long. Set it based on measured runtime, not guesses.

Treating failed jobs and slow jobs as the same issue

A failing job needs debugging. A slow job needs profiling. A stuck worker needs recovery. These are different problems.

A real production pattern to watch for

One of the more common incidents looks like this:

A team adds a new queued job to sync customer data to a third-party API. It works in testing, so it goes onto the default queue with emails and notifications. A few days later, the third-party starts responding in 20 to 30 seconds instead of 500ms.

Nothing is technically broken. Workers stay up. Jobs still complete. But each worker now handles far fewer jobs per minute, the default queue backs up, and users stop getting login emails on time.

The wrong response is to immediately double worker count. That may help for a moment, but if the API starts rate-limiting, retries pile up too.

The better response is:

  1. Confirm that throughput dropped specifically on the sync job
  2. Move sync jobs to a dedicated queue
  3. Cap worker concurrency for that queue
  4. Increase backoff for transient failures
  5. Keep fast notification jobs isolated on their own workers

That restores critical user-facing throughput without losing sync jobs.

Build the observability before the incident

The easiest backlog to debug is the one you already instrumented.

At minimum, capture:

  • Queue depth by queue name
  • Oldest pending job age
  • Job runtime by class
  • Failure count by class and exception type
  • Retry count and timeout count
  • Worker process count

If you use Horizon, you already get a strong baseline. If not, even simple logs and a few application metrics can turn a guessing session into a quick diagnosis.

Conclusion

A production queue backlog is not a single problem with a single fix. The safe path is to identify whether workers are stuck, jobs are slow, or retries are consuming capacity, and then respond to that specific failure mode.

The main goal is not just to make the queue number smaller. It is to restore throughput without losing work or creating duplicates.

In practice, that means measuring backlog age, checking worker health, isolating heavy jobs, tuning retries carefully, and designing jobs so they can be safely retried. If you approach queue incidents that way, you can usually recover calmly, even when production is backing up fast.