Skip to content

CRITICAL BUG: APScheduler Permanent Deadlock on MySQL Network Failures #1083

@akhilesh-chander

Description

@akhilesh-chander

Things to check first

  • I have checked that my issue does not already have a solution in the FAQ

  • I have searched the existing issues and didn't find my bug already reported there

  • I have checked that my bug is still present in the latest release

Version

3.11.0

What happened?

Bug Summary
Severity: CRITICAL - Production Breaking
Component: SQLAlchemyJobStore with MySQL
Impact: Complete scheduler halt requiring manual intervention
Affected Versions: All versions using SQLAlchemyJobStore with MySQL
APScheduler enters a permanent deadlock state when MySQL network connectivity fails during job execution. The scheduler never recovers and stops processing all jobs indefinitely.
Environment

APScheduler Version: 3.11.0
Python Version: 3.11
Database: MySQL 8.0
JobStore: SQLAlchemyJobStore
Deployment: Kubernetes pods

Problem Description
When MySQL experiences temporary network connectivity issues (ping breaks), APScheduler fails to update job next_run_time in the database. This causes the scheduler to query the same stale job repeatedly, enter a deadlock state, and never wake up again.
Critical Sequence
⚠️ CRITICAL: MySQL must fail DURING active job execution for deadlock to occur

  1. Scheduler picks job with earliest next_run_time via get_next_run_time()
  2. Job starts executing (scheduler is now committed to this job)
  3. MySQL network fails DURING job execution (not before, not after)
  4. Job completes execution in memory
  5. Scheduler attempts update_job() to set new next_run_time
  6. update_job() fails silently due to MySQL downtime - no exception raised
  7. Job's next_run_time remains stale in database (shows old execution time)
  8. MySQL network recovers (but damage is already done)
  9. Next get_next_run_time() query returns same stale job timestamp
  10. Scheduler calculates final wakeup time based on stale data and never wakes up again
Actual Logs (Production Evidence)
Oct 23, 2025 @ 11:07:00.056 DEBUG - Next wakeup is due at 2025-10-23 11:12:00+05:30 (in 299.943589 seconds)
Oct 23, 2025 @ 11:12:00.045 DEBUG - Next wakeup is due at 2025-10-23 11:20:00+05:30 (in 479.954906 seconds)  
Oct 23, 2025 @ 11:20:00.050 DEBUG - Next wakeup is due at 2025-10-23 11:22:00+05:30 (in 119.949250 seconds)
Oct 23, 2025 @ 11:22:00.025 DEBUG - Next wakeup is due at 2025-10-23 11:30:00+05:30 (in 479.974420 seconds)

[NO MORE LOGS AFTER THIS POINT - SCHEDULER PERMANENTLY FROZEN]
SELECT id, next_run_time, FROM_UNIXTIME(next_run_time) as readable_time 
FROM apscheduler_jobs_server1 ORDER BY next_run_time;

-- Results show stale timestamps:
-- id: 365, next_run_time: 1761199200, readable_time: 2025-10-23 11:30:00.000000 (in the past)
-- id: 821, next_run_time: 1761199320, readable_time: 2025-10-23 11:32:00.000000 (in the past)

Root Cause Analysis
Source Code Analysis
File: apscheduler/jobstores/sqlalchemy.py
Problem 1 - get_next_run_time() method (lines 88-97):


def get_next_run_time(self):
    selectable = (
        select(self.jobs_t.c.next_run_time)
        .where(self.jobs_t.c.next_run_time != null())
        .order_by(self.jobs_t.c.next_run_time)  # ← ALWAYS returns smallest timestamp
        .limit(1)
    )
    with self.engine.begin() as connection:
        next_run_time = connection.execute(selectable).scalar()
        return utc_timestamp_to_datetime(next_run_time)

Issue: No staleness detection - always returns job with smallest timestamp even if it's hours/days in the past.

def _process_jobs():
    # 1. Get next job to run
    next_run_time = jobstore.get_next_run_time()  # ← Returns earliest job
    
    # 2. Execute the job
    job = jobstore.get_due_jobs(now)[0]
    job.func(*job.args, **job.kwargs)  # ← JOB RUNS HERE (vulnerable window starts)
    ****
    # 3. Calculate next run time
    job.next_run_time = calculate_next_run_time(job)
    
    # 4. Update database with new next_run_time
    jobstore.update_job(job)  # ← MYSQL FAILURE HERE = DEADLOCK
                              # ← (vulnerable window ends)

Problem 2 - update_job() method (lines 113-125):

def update_job(self, job):
    # ... update query setup ...
    with self.engine.begin() as connection:
        result = connection.execute(update)  # ← Network failure here
        if result.rowcount == 0:
            raise JobLookupError(job.id)

Issue: MySQL network failures during connection.execute(update) are not handled. Database update fails silently but scheduler continues assuming success.

Image Image

How can we reproduce the bug?

Minimal Reproduction

from apscheduler.schedulers.blocking import BlockingScheduler
from apscheduler.jobstores.sqlalchemy import SQLAlchemyJobStore
import time

def long_running_job():
    print("Job started - this is the critical window!")
    time.sleep(30)  # 30-second window for MySQL failure
    print("Job completed")

# Setup
jobstore = SQLAlchemyJobStore(url='mysql://user:pass@host/db')
scheduler = BlockingScheduler(jobstores={'default': jobstore})
scheduler.add_job(long_running_job, 'interval', minutes=2, id='test_job')
scheduler.start()

# CRITICAL: MySQL must fail DURING job execution (within the 30-second window)
# Timing sequence:
# 1. Wait for "Job started" log message
# 2. IMMEDIATELY simulate MySQL failure:
#    - Block MySQL port: iptables -A OUTPUT -p tcp --dport 3306 -j DROP
#    - Or stop MySQL service: systemctl stop mysql
# 3. Wait for "Job completed" message (job finishes in memory)
# 4. Restore MySQL: iptables -D OUTPUT -p tcp --dport 3306 -j DROP
# 5. Observe: Scheduler logs final "Next wakeup" message then NEVER logs again

Why Timing Matters

Before Job Execution: Scheduler gets database error, handles it gracefully
After Job Execution: Database update succeeds, no stale data created
DURING Job Execution: Job completes but database update fails silently → DEADLOCK

Production Trigger Scenarios - TIMING CRITICAL
Scenario 1: MySQL Maintenance Window

Job starts execution at 11:20:00
DBA kills MySQL connections for maintenance at 11:20:15 (during job)
Job completes at 11:20:45, tries to update database
Update fails → Scheduler deadlocked

Scenario 2: Kubernetes Pod Restart

Long-running job (5+ minutes) starts execution
Kubernetes restarts MySQL pod during job execution
Job completes but cannot update next_run_time
Result: Permanent scheduler freeze

Scenario 3: Network Partition

Scheduled job starts processing large dataset
Network switch failure isolates scheduler from MySQL
Job finishes in memory but database update impossible
Scheduler never recovers even after network restoration

Scenario 4: Connection Pool Exhaustion

High-concurrency job starts execution
MySQL connection pool exhausted by other services
Job completes but scheduler cannot get DB connection for update
Database remains stale → Deadlock

Common Pattern: The vulnerable window is during job execution when:

Job is running (consuming CPU/memory)
Scheduler is waiting to update next_run_time
Any MySQL connectivity issue occurs

Non-vulnerable Windows:

MySQL down before job scheduled to run → Job never starts (graceful)
MySQL down after job completes and updates → No stale data (safe)

Expected vs Actual Behavior
Expected:

Scheduler detects database update failures
Retries updates when network recovers
Continues processing other jobs
Self-recovers without manual intervention

Actual:

Silent failure during database update
Scheduler queries return stale data indefinitely
Complete scheduler freeze
Manual database cleanup required for recovery

Business Impact

Zero Tolerance: All scheduled jobs stop working (notifications, reports, data processing)
Silent Failure: No error logs indicate the problem
Manual Recovery: Requires DBA intervention and application restart
Production Downtime: Critical scheduled operations halt completely

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions