-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Fix APScheduler Deadlocks with Dual Database Job Coordinators #15162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
danielaskdd
wants to merge
10
commits into
BerriAI:main
Choose a base branch
from
danielaskdd:fix-sigint-exit
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,031
−149
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
• Track scheduler instance globally • Shutdown scheduler before database • Use wait=False for non-blocking shutdown
* Prevent concurrent DB access deadlocks * Split 10s vs 60s+ task frequencies * Add unified job state tracking * Improve error isolation per job
* Move global assignment earlier * Fix scheduler initialization order * Improve lifecycle management
@danielaskdd is attempting to deploy a commit to the CLERKIEAI Team on Vercel. A member of the Team first needs to authorize it. |
* Add PLR0915 noqa comment
- Pre-calc random intervals once at init - Increased scheduler frequency of low frequency db job for more precise triggering of jobs - Add comprehensive test coverage
- Remove unused mock_patch_aembedding function - Replace decorator with inline AsyncMock - Use kwargs for parameter verification - Check mock was called before assertions
- Add proxy_logging_obj mocking to prevent interference between unit tests - Add premium_user mock set to True to bypass enterprise validation
…stances reached" warnings - Separate high-frequency (10s) and low-frequency (30min) tasks - Configure misfire_grace_time: 5s for high-freq, 20min for low-freq - Set coalesce=False to skip missed runs instead of queuing - Eliminate "maximum instances reached" warnings This reduces unnecessary scheduling overhead for long-running tasks while maintaining proper execution timing for all database operations.
Add new commit: Optimize database job scheduling to eliminate "maximum instances reached" warnings
This reduces unnecessary scheduling overhead for long-running tasks |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix APScheduler Deadlocks with Dual Database Job Coordinators
Current Problem
The proxy server experienced periodic deadlocks during startup, primarily due to concurrent database access attempts by multiple APScheduler jobs. This led to system unresponsiveness, specifically a stall within the
get_next_fire_time()
function, effectively blocking all event loops.Root Cause
Multiple database jobs (update_spend, reset_budget, add_deployment, get_credentials, spend_log_cleanup, check_batch_cost) were scheduled independently, leading to:
Solution
Implemented a dual-coordinator architecture that separates database jobs by execution frequency while maintaining sequential execution within each coordinator to prevent database conflicts.
Architecture Changes
1. DatabaseJobsCoordinator Class (Lines 3624-3665)
2. High-Frequency Coordinator (Lines 3668-3702)
Runs every 10 seconds and handles time-sensitive configuration updates:
add_deployment
: Refresh model configuration from databaseget_credentials
: Refresh credentials from databaseRationale: These jobs need frequent execution to keep the proxy's model configuration in sync with database changes.
3. Low-Frequency Coordinator (Lines 3705-3844)
Runs every 60 seconds and handles maintenance tasks:
update_spend
: Update spend logs (~60s with randomization to avoid worker collision)reset_budget
: Reset budget (3600-7200s with randomization)spend_log_cleanup
: Clean old logs (configurable interval)check_batch_cost
: Check batch costs (configurable interval)Rationale: These tasks are less time-critical and can run at longer intervals without impacting system responsiveness.
4. Improved Scheduler Lifecycle Management (Lines 3947-4044)
_scheduler_instance
immediately after creation for proper shutdown handling_initialize_spend_tracking_background_jobs
Key Benefits
✅ Eliminates Deadlocks: Sequential execution within each coordinator prevents concurrent database access
✅ Prevents Blocking: High-frequency tasks (10s) run independently from slow low-frequency tasks (60s+)
✅ Fault Tolerance: One failed job doesn't stop other jobs in the coordinator
✅ Improved Reliability: Enhanced error handling with descriptive logging and fallback to defaults
✅ Better Performance: Predictable execution patterns with separated high/low frequency paths
✅ Proper Cleanup: Improved scheduler shutdown lifecycle prevents resource leaks
Technical Details
Error Handling Improvements
Execution Flow
Backward Compatibility
Code Changes Summary
Modified Files
litellm/proxy/proxy_server.py
Lines Changed
DatabaseJobsCoordinator
classhigh_frequency_database_jobs_coordinator
low_frequency_database_jobs_coordinator
initialize_scheduled_background_jobs
to use dual coordinatorsWhy is this change important?
This change directly addresses a potential source of critical database errors (deadlocks), which can freeze proxy operations and require a manual restart. It hardens the proxy's stability, making it more reliable for production deployments.