Why Laravel Jobs Fail Silently in Production and How to Catch Them Early
Introduction
One of the easiest mistakes to make with Laravel queues is trusting dispatch() too much.
You queue a job to send an invoice, sync a CRM record, generate a report, or resize an uploaded image. The controller returns 200. The UI shows success. Everyone assumes the work is done.
Then production says otherwise.
A customer never gets their receipt. A webhook is never sent. A subscription state stays stale for hours. When you check the code path, the job was definitely dispatched. But somewhere between “accepted by the queue” and “completed successfully,” the work disappeared.
This is a real production problem because queue systems create a false sense of completion. Dispatching a job only means Laravel handed work to a broker or storage backend. It does not mean a worker picked it up, finished it, and committed the result.
If you do not explicitly monitor that gap, jobs can fail silently for a long time.
The Real Problem: Dispatched Is Not the Same as Done
In production, there are several points where a job can be lost, retried, delayed, or effectively hidden.
A job may throw an exception and keep retrying until it reaches its max attempts. It may exceed a timeout and get killed by the worker. A worker process may crash after reserving the job but before finishing it. An external API may return intermittent failures that never get surfaced clearly. In some teams, the only signal is a growing failed_jobs table that nobody checks.
The dangerous part is that each layer can make the system look healthy.
Your HTTP request succeeds because dispatching worked. Redis is up, so queue writes succeed. Horizon or Supervisor shows workers running. Meanwhile, customers are still missing important actions because the failure mode is in execution, not dispatch.
Why this happens in Laravel applications
Laravel makes queues easy to adopt, which is good, but it also means teams often stop at the happy path.
They dispatch jobs but do not define retry behavior intentionally. They let exceptions bubble in some jobs and swallow them in others. They rely on the default failed job behavior but never alert on it. They log errors inside jobs but do not connect those logs to business impact.
A common pattern looks like this:
public function handle(): void
{
try {
$this->crmService->syncCustomer($this->customerId);
} catch (\Throwable $e) {
Log::error('CRM sync failed', [
'customer_id' => $this->customerId,
'error' => $e->getMessage(),
]);
}
}
This looks responsible, but it is often the start of silent failure. The exception is logged, but the job is treated as successful because nothing is re-thrown. Laravel will not retry it. It will not land in failed_jobs. Your monitoring may never notice.
From the queue system’s perspective, the job completed.
From the customer’s perspective, it failed.
How to Diagnose Silent Queue Failures Without Guessing
When jobs go missing in production, guessing is expensive. Start by proving where work stops.
1. Trace a single business action end-to-end
Pick one concrete failure, not a general symptom.
For example: “Order #18421 was paid, but the confirmation email and warehouse sync never happened.” Then trace that action through each step:
- Was the job dispatched?
- Was it written to the queue backend?
- Was it picked up by a worker?
- Did it start running?
- Did it complete?
- Did the expected side effect happen?
This is why job-level correlation data matters. Include identifiers like order ID, customer ID, tenant ID, and an operation UUID in every job log context.
public function handle(): void
{
Log::withContext([
'job' => static::class,
'order_id' => $this->orderId,
]);
// job logic
}
Without this, queue debugging turns into reading unrelated logs and guessing which failure belongs to which customer action.
2. Check retries and attempts before reading application code
A large number of queue issues are operational, not logical.
Inspect the job’s retry settings:
triesbackofftimeout- worker
--tries - worker
--timeout - queue connection
retry_after
Misalignment here causes confusing behavior. For example, if timeout is longer than retry_after, a worker can still be processing a job when the queue makes it available again. That can produce duplicate execution, overlapping side effects, or jobs that appear inconsistent.
A practical example:
class SyncInvoiceToXero implements ShouldQueue
{
use Queueable;
public int $tries = 5;
public int $timeout = 30;
public function backoff(): array
{
return [60, 300, 900];
}
public function handle(XeroClient $client): void
{
$client->pushInvoice($this->invoiceId);
}
}
If the external API regularly takes 45 seconds but your timeout is 30, jobs will keep dying mid-flight. If workers restart often, this can look random until you compare execution time against timeout settings.
3. Inspect failed_jobs, but do not stop there
Yes, check failed_jobs. But do not assume an empty table means the queue is healthy.
Jobs can fail without landing there if:
- exceptions are swallowed
- workers are killed at the process level
- deployment restarts interrupt long-running jobs
- infrastructure issues prevent normal failure handling
- jobs are manually deleted or acknowledged too early in custom flows
The failed jobs table is useful, but it only covers one category of failure.
4. Listen to queue events
Laravel gives you good hooks here, and many teams do not use them enough.
You can listen for JobFailed, JobProcessed, and JobProcessing to create visibility around execution.
use Illuminate\Queue\Events\JobFailed;
use Illuminate\Queue\Events\JobProcessed;
use Illuminate\Support\Facades\Event;
use Illuminate\Support\Facades\Log;
public function boot(): void
{
Event::listen(JobProcessed::class, function (JobProcessed $event) {
Log::info('Job processed', [
'name' => $event->job->resolveName(),
'queue' => $event->job->getQueue(),
]);
});
Event::listen(JobFailed::class, function (JobFailed $event) {
Log::error('Job failed', [
'name' => $event->job->resolveName(),
'queue' => $event->job->getQueue(),
'exception' => $event->exception->getMessage(),
]);
});
}
For production systems, these events are more useful when sent to metrics and alerting instead of only logs.
5. Measure outcomes, not just infrastructure health
This is the part that catches issues before customers do.
Do not only monitor whether workers are running. Monitor whether business-critical jobs are completing within expected time windows.
Examples:
- percentage of
SendInvoiceEmailjobs completed within 2 minutes - number of
SyncSubscriptionToBillingjobs failed in 15 minutes - age of oldest pending
GenerateReportjob - count of jobs stuck in retry state by job class
A running worker is not proof of a working system. Completion metrics are.
Practical Fixes That Actually Improve Visibility
1. Never swallow exceptions unless you replace that with explicit failure handling
If a job truly failed, let it fail.
public function handle(): void
{
$this->crmService->syncCustomer($this->customerId);
}
If you need to add context, log and rethrow:
public function handle(): void
{
try {
$this->crmService->syncCustomer($this->customerId);
} catch (\Throwable $e) {
Log::error('CRM sync failed', [
'customer_id' => $this->customerId,
'exception' => $e,
]);
throw $e;
}
}
Trade-off: this increases visible failures, which can feel noisy at first. That is a good problem. Noise can be tuned. Invisible failure cannot.
2. Use the failed() method for cleanup and escalation
Laravel lets each job define what should happen after final failure.
public function failed(?\Throwable $exception): void
{
Log::critical('Invoice sync permanently failed', [
'invoice_id' => $this->invoiceId,
'error' => $exception?->getMessage(),
]);
// Notify internal team, mark status in database, create support task, etc.
}
This is where you should connect technical failure to business recovery. Update a status column. Trigger an internal alert. Make the failure visible in an admin panel.
3. Make retries intentional
Retries are not always helpful.
If a job fails because of a temporary API timeout, retrying makes sense. If it fails because the payload is invalid or a related model no longer exists, retries only delay detection and waste worker time.
Use job-specific retry rules. Consider backoff() arrays for transient failures and fail fast for permanent ones.
You can also classify exceptions:
public function handle(PaymentGateway $gateway): void
{
try {
$gateway->capture($this->paymentId);
} catch (TemporaryGatewayException $e) {
throw $e;
} catch (InvalidPaymentStateException $e) {
$this->fail($e);
}
}
That prevents five pointless retries on data that will never succeed.
4. Add idempotency for jobs that may run more than once
Silent failure is often discovered next to duplicate execution.
If a worker dies after partially completing a job, the job may run again. That is normal in distributed systems. Critical jobs should be safe to re-run.
Examples:
- store external sync state with unique keys
- use database constraints to prevent duplicate records
- check whether an email was already sent before sending again
- record third-party request IDs
Visibility and idempotency work together. Once you start surfacing retries and timeouts correctly, you also need jobs that can handle replay safely.
5. Use Horizon or metrics tooling as an operational dashboard, not decoration
If you use Redis queues, Horizon is worth treating as production infrastructure, not just a nice UI.
Watch:
- failed jobs by type
- throughput changes by queue
- runtime spikes
- long wait times
- retry trends after deployments
If you are not using Horizon, push queue metrics into whatever you already trust: Prometheus, Datadog, New Relic, Sentry, or even structured logs plus alerting.
The key is this: create alerts on symptoms that matter. “Worker process is down” is useful. “Order fulfillment jobs have a 20% failure rate in the last 10 minutes” is much better.
Common Mistakes
Treating logs as monitoring
Writing an error log is not the same as surfacing a problem. If nobody is alerted and nothing is measured, it is just archived failure.
Using one retry policy for every job
A thumbnail generator, billing sync, and webhook dispatcher do not fail for the same reasons. Shared defaults are fine as a starting point, but production jobs need class-specific behavior.
Ignoring worker lifecycle during deploys
A lot of “random” failures come from restarts. Long-running jobs interrupted by deploys, autoscaling, or memory limits can disappear into retry churn unless your timeouts and worker shutdown strategy are aligned.
Only checking failed_jobs
Useful, but incomplete. Silent loss often lives outside that table.
A Real Production Pattern Worth Watching
A common failure pattern in smaller Laravel teams looks like this:
An order is placed, and three jobs are dispatched: send email, sync inventory, and push accounting data. Dispatch succeeds, so the request returns success immediately. Inventory sync starts failing because the vendor API rate-limits requests during peak traffic.
The job catches the exception, logs it, and exits. No retry happens because nothing is thrown. No failed job entry appears. The customer got the order confirmation, so support does not notice immediately. Two days later, stock counts are wrong and finance data is incomplete.
The fix is usually not one change. It is a set of small production-minded improvements:
- rethrow transient exceptions
- fail permanent validation issues explicitly
- track job completion by order ID
- alert on inventory sync failure rate
- expose unsynced orders in an internal dashboard
That combination turns a hidden queue problem into an operationally visible system.
Conclusion
Laravel queues rarely fail silently because Laravel is broken. They fail silently because teams stop observing the system at dispatch.
That is the mindset shift that matters: queued work is not complete when it is accepted. It is complete when the side effect you care about has happened, and you can prove it.
If you want to catch queue issues early, focus on execution visibility. Let real failures fail. Use failed() for recovery. Tune retries per job. Measure completion, not just worker uptime. And attach queue health to business outcomes, not only infrastructure dashboards.
Once you do that, production queue bugs stop being mysterious customer reports and start becoming detectable operational signals.
Member discussion