Skip to content

Conversation

danielaskdd
Copy link
Contributor

Fix APScheduler Deadlocks with Dual Database Job Coordinators

Current Problem

The proxy server experienced periodic deadlocks during startup, primarily due to concurrent database access attempts by multiple APScheduler jobs. This led to system unresponsiveness, specifically a stall within the get_next_fire_time() function, effectively blocking all event loops.

Root Cause

Multiple database jobs (update_spend, reset_budget, add_deployment, get_credentials, spend_log_cleanup, check_batch_cost) were scheduled independently, leading to:

  • Concurrent database access causing deadlocks
  • Event loop blocking during startup
  • System instability with multiple workers
  • APScheduler scheduler getting stuck during initialization

Solution

Implemented a dual-coordinator architecture that separates database jobs by execution frequency while maintaining sequential execution within each coordinator to prevent database conflicts.

Architecture Changes

1. DatabaseJobsCoordinator Class (Lines 3624-3665)

  • Tracks last execution time for each database job
  • Ensures jobs respect their configured intervals
  • Shared state tracker between both coordinators for centralized timing management

2. High-Frequency Coordinator (Lines 3668-3702)

Runs every 10 seconds and handles time-sensitive configuration updates:

  • add_deployment: Refresh model configuration from database
  • get_credentials: Refresh credentials from database

Rationale: These jobs need frequent execution to keep the proxy's model configuration in sync with database changes.

3. Low-Frequency Coordinator (Lines 3705-3844)

Runs every 60 seconds and handles maintenance tasks:

  • update_spend: Update spend logs (~60s with randomization to avoid worker collision)
  • reset_budget: Reset budget (3600-7200s with randomization)
  • spend_log_cleanup: Clean old logs (configurable interval)
  • check_batch_cost: Check batch costs (configurable interval)

Rationale: These tasks are less time-critical and can run at longer intervals without impacting system responsiveness.

4. Improved Scheduler Lifecycle Management (Lines 3947-4044)

  • Assign _scheduler_instance immediately after creation for proper shutdown handling
  • Removed duplicate assignment in _initialize_spend_tracking_background_jobs
  • Ensures scheduler can be properly shut down even if initialization fails

Key Benefits

Eliminates Deadlocks: Sequential execution within each coordinator prevents concurrent database access
Prevents Blocking: High-frequency tasks (10s) run independently from slow low-frequency tasks (60s+)
Fault Tolerance: One failed job doesn't stop other jobs in the coordinator
Improved Reliability: Enhanced error handling with descriptive logging and fallback to defaults
Better Performance: Predictable execution patterns with separated high/low frequency paths
Proper Cleanup: Improved scheduler shutdown lifecycle prevents resource leaks

Technical Details

Error Handling Improvements

  • Added fallback to default values (86400s = 1 day) for invalid spend log cleanup intervals
  • Each job failure is logged but doesn't prevent other jobs from running
  • Enhanced logging with ✓/✗ symbols for easy debugging

Execution Flow

Startup:
1. Create AsyncIOScheduler instance
2. Assign to global _scheduler_instance immediately
3. Load models/credentials if store_model_in_db=True
4. Schedule high-frequency coordinator (10s interval)
5. Schedule low-frequency coordinator (60s interval)
6. Schedule other jobs (alerting, reporting, etc.)
7. Start scheduler

Shutdown:
1. Stop scheduler (wait=False to avoid hanging)
2. Disconnect database
3. Close other connections

Backward Compatibility

  • Maintains all existing job functionality
  • Preserves randomization for budget reset and spend updates
  • All non-database jobs (alerting, reporting) continue to run independently
  • No configuration changes required
  • No database schema changes needed

Code Changes Summary

Modified Files

  • litellm/proxy/proxy_server.py

Lines Changed

  • Lines 3624-3665: Added DatabaseJobsCoordinator class
  • Lines 3668-3702: Implemented high_frequency_database_jobs_coordinator
  • Lines 3705-3844: Implemented low_frequency_database_jobs_coordinator
  • Lines 3947-4044: Refactored initialize_scheduled_background_jobs to use dual coordinators
  • Lines 3717-3732: Improved error handling for invalid configuration values
  • Lines 4049-4051: Improved scheduler lifecycle management

Why is this change important?

This change directly addresses a potential source of critical database errors (deadlocks), which can freeze proxy operations and require a manual restart. It hardens the proxy's stability, making it more reliable for production deployments.

• Track scheduler instance globally
• Shutdown scheduler before database
• Use wait=False for non-blocking shutdown
* Prevent concurrent DB access deadlocks
* Split 10s vs 60s+ task frequencies
* Add unified job state tracking
* Improve error isolation per job
* Move global assignment earlier
* Fix scheduler initialization order
* Improve lifecycle management
Copy link

vercel bot commented Oct 3, 2025

@danielaskdd is attempting to deploy a commit to the CLERKIEAI Team on Vercel.

A member of the Team first needs to authorize it.

- Pre-calc random intervals once at init
- Increased scheduler frequency of low frequency db job for more precise triggering of jobs
- Add comprehensive test coverage
- Remove unused mock_patch_aembedding function
- Replace decorator with inline AsyncMock
- Use kwargs for parameter verification
- Check mock was called before assertions
- Add proxy_logging_obj mocking to prevent interference between unit tests
- Add premium_user mock set to True to bypass enterprise validation
…stances reached" warnings

- Separate high-frequency (10s) and low-frequency (30min) tasks
- Configure misfire_grace_time: 5s for high-freq, 20min for low-freq
- Set coalesce=False to skip missed runs instead of queuing
- Eliminate "maximum instances reached" warnings

This reduces unnecessary scheduling overhead for long-running tasks
while maintaining proper execution timing for all database operations.
@danielaskdd
Copy link
Contributor Author

Add new commit:

Optimize database job scheduling to eliminate "maximum instances reached" warnings

  • Separate high-frequency (10s) and low-frequency (30min) tasks
  • Configure misfire_grace_time: 5s for high-freq, 20min for low-freq
  • Set coalesce=False to skip missed runs instead of queuing
  • Eliminate "maximum instances reached" warnings

This reduces unnecessary scheduling overhead for long-running tasks
while maintaining proper execution timing for all database operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants