-
Notifications
You must be signed in to change notification settings - Fork 11
Add Celery worker memory and task limits for leak prevention #1051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
712f579
Add Celery worker memory and task limits for leak prevention
mihow 52ad3bc
fix: celery task time limit was in hours not days
mihow 515ce53
feat: refine limits
mihow db3e962
fix: remove setting that isn't applicable for RabbitMQ
mihow 7ac1de1
chore: typo in docs
mihow File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,226 @@ | ||
| # Celery Worker Monitoring Guide | ||
|
|
||
| This guide explains how to monitor Celery worker health and detect issues like memory leaks or hung workers. | ||
|
|
||
| ## Worker Protections In Place | ||
|
|
||
| The workers are configured with automatic protections to prevent long-running issues: | ||
|
|
||
| - **`--max-tasks-per-child=100`** - Each worker process restarts after 100 tasks | ||
| - Prevents memory leaks from accumulating over time | ||
| - Clean slate every 100 tasks ensures consistent performance | ||
|
|
||
| - **`--max-memory-per-child=2097152`** - Each worker process restarts if it exceeds 2 GiB | ||
| - Measured in KB (2097152 KB = 2 GiB) | ||
| - With prefork pool (default), this is **per worker process** | ||
| - Example: 8 CPUs = 8 worker processes × 2 GiB = 16 GiB max total | ||
|
|
||
| These protections cause workers to gracefully restart before problems occur. | ||
|
|
||
| ## Monitoring Tools | ||
|
|
||
| ### 1. Flower (http://localhost:5555) | ||
|
|
||
| **Active Tasks View:** | ||
| - Shows currently running tasks | ||
| - Look for tasks stuck in "running" state for unusually long periods | ||
| - Your `CELERY_TASK_TIME_LIMIT` is 4 days, so anything approaching that is suspicious | ||
|
|
||
| **Worker Page:** | ||
| - Shows last heartbeat timestamp for each worker | ||
| - Stale heartbeat (> 60 seconds) indicates potential issue | ||
| - Shows worker status, active tasks, and resource usage | ||
|
|
||
| **Task Timeline:** | ||
| - Sort tasks by duration to find long-running ones | ||
| - Filter by state (SUCCESS, FAILURE, RETRY, etc.) | ||
| - See task arguments and return values | ||
|
|
||
| **Limitations:** | ||
| - Cannot detect workers in deadlock that stop sending heartbeats | ||
| - Shows task status but not detailed memory usage | ||
|
|
||
| ### 2. New Relic APM | ||
|
|
||
| **Transaction Traces:** | ||
| - Automatic instrumentation via `newrelic-admin` wrapper | ||
| - See detailed task execution times and database queries | ||
| - Identify slow tasks that might cause memory buildup | ||
|
|
||
| **Infrastructure Monitoring:** | ||
| - Monitor container CPU and memory usage | ||
| - Set alerts for high memory usage approaching limits | ||
| - Track worker process count and restarts | ||
|
|
||
| **Custom Instrumentation:** | ||
| - Add custom metrics for task-specific monitoring | ||
| - Track business metrics (images processed, detections created, etc.) | ||
|
|
||
| ### 3. Docker Monitoring | ||
|
|
||
| **Check worker health:** | ||
| ```bash | ||
| # View running containers and resource usage | ||
| docker stats | ||
|
|
||
| # Check worker logs for restart messages | ||
| docker compose logs celeryworker | grep -i "restart\|warm shutdown\|cool shutdown" | ||
|
|
||
| # See recent worker activity | ||
| docker compose logs --tail=100 celeryworker | ||
| ``` | ||
|
|
||
| **Worker restart patterns:** | ||
| When a worker hits limits, you'll see: | ||
| ``` | ||
| [INFO/MainProcess] Warm shutdown (MainProcess) | ||
| [INFO/MainProcess] Celery worker: Restarting pool processes | ||
| ``` | ||
|
|
||
| ## Detecting Hung Workers | ||
|
|
||
| ### Signs of a hung worker: | ||
|
|
||
| 1. **Tasks stuck in "started" state** (Flower active tasks page) | ||
| - Task shows as running but makes no progress | ||
| - Duration keeps increasing without completion | ||
|
|
||
| 2. **Stale worker heartbeats** (Flower workers page) | ||
| - Last heartbeat > 60 seconds ago | ||
| - Worker shows as offline or unresponsive | ||
|
|
||
| 3. **Queue backlog building up** (Flower or RabbitMQ management) | ||
| - Tasks accumulating in queue but not being processed | ||
| - Check: http://localhost:15672 (RabbitMQ management UI) | ||
| - Default credentials: `rabbituser` / `rabbitpass` | ||
|
|
||
| 4. **Container health** (Docker stats) | ||
| - CPU stuck at 100% with no task progress | ||
| - Memory at ceiling without task completion | ||
|
|
||
| ### Manual investigation: | ||
|
|
||
| ```bash | ||
| # Attach to worker container | ||
| docker compose exec celeryworker bash | ||
|
|
||
| # Check running Python processes | ||
| ps aux | grep celery | ||
|
|
||
| # Check memory usage of worker processes | ||
| ps aux --sort=-%mem | grep celery | head -10 | ||
|
|
||
| # See active connections to RabbitMQ | ||
| netstat -an | grep 5672 | ||
| ``` | ||
|
|
||
| ## Recommended Monitoring Setup | ||
|
|
||
| ### Flower Alerts (Manual Checks) | ||
|
|
||
| Visit Flower periodically and check: | ||
| 1. Workers page - all workers showing recent heartbeats? | ||
| 2. Tasks page - any tasks running > 1 hour? | ||
| 3. Queue length - is it growing unexpectedly? | ||
|
|
||
| ### New Relic Alerts (Automated) | ||
|
|
||
| Set up alerts for: | ||
| - Container memory > 80% for > 5 minutes | ||
| - Task duration > 1 hour (customize based on your use case) | ||
| - Worker process restarts > 10 per hour | ||
| - Queue depth > 100 tasks | ||
|
|
||
| ### RabbitMQ Management UI | ||
|
|
||
| Access at: http://localhost:15672 | ||
| - Monitor queue lengths | ||
| - Check consumer counts (should match number of worker processes) | ||
| - View connection status and channels | ||
|
|
||
| ## Troubleshooting Common Issues | ||
|
|
||
| ### Worker keeps restarting every 100 tasks | ||
|
|
||
| **This is normal behavior** with `--max-tasks-per-child=100`. | ||
|
|
||
| If you see too many restarts: | ||
| - Check task complexity - are tasks heavier than expected? | ||
| - Consider increasing the limit: `--max-tasks-per-child=200` | ||
| - Monitor if specific task types cause issues | ||
|
|
||
| ### Worker hitting memory limit frequently | ||
|
|
||
| If workers constantly hit the 2 GiB limit: | ||
| - Review which tasks use the most memory | ||
| - Optimize data loading (use streaming for large datasets) | ||
| - Consider increasing limit: `--max-memory-per-child=3145728` (3 GiB) | ||
| - Check for memory leaks in task code | ||
|
|
||
| ### Tasks timing out | ||
|
|
||
| Current limits: | ||
| - `CELERY_TASK_TIME_LIMIT = 4 * 60 * 60 * 24` (4 days) | ||
| - `CELERY_TASK_SOFT_TIME_LIMIT = 3 * 60 * 60 * 24` (3 days) | ||
|
|
||
| These are very generous. If tasks timeout: | ||
| - Check task logs for errors | ||
| - Review data volume being processed | ||
| - Consider breaking large jobs into smaller tasks | ||
|
|
||
| ### Queue backlog growing | ||
|
|
||
| Possible causes: | ||
| 1. Workers offline - check `docker compose ps` | ||
| 2. Tasks failing and retrying - check Flower failures | ||
| 3. Task creation rate > processing rate - scale up workers | ||
| 4. Tasks hung - restart workers: `docker compose restart celeryworker` | ||
|
|
||
| ## Scaling Workers | ||
|
|
||
| ### Horizontal scaling (more containers): | ||
|
|
||
| ```bash | ||
| # Local development | ||
| docker compose up -d --scale celeryworker=3 | ||
|
|
||
| # Production (docker-compose.worker.yml) | ||
| docker compose -f docker-compose.worker.yml up -d --scale celeryworker=3 | ||
| ``` | ||
|
|
||
| ### Vertical scaling (more processes per container): | ||
|
|
||
| Add concurrency flag to worker start command: | ||
| ```bash | ||
| celery -A config.celery_app worker --queues=antenna --concurrency=16 | ||
| ``` | ||
|
|
||
| Default is number of CPUs. Increase if CPU is underutilized. | ||
|
|
||
| ## Memory Tuning Guidelines | ||
|
|
||
| Based on your ML workload: | ||
|
|
||
| **Current setting:** 2 GiB per worker process (production), 1 GiB (local dev) | ||
| **Task requirement:** JSON orchestration with large request/response payloads | ||
|
|
||
| This provides adequate headroom for JSON processing and HTTP overhead. | ||
|
|
||
| **If tasks consistently use more:** | ||
| - Increase: `--max-memory-per-child=3145728` (3 GiB) | ||
| - Ensure container/host has sufficient memory | ||
| - Monitor total memory: (num CPUs) × (limit per child) | ||
|
|
||
| **If tasks use much less:** | ||
| - Decrease: `--max-memory-per-child=1048576` (1 GiB) | ||
| - Allows more workers on same hardware | ||
| - Faster to detect actual memory leaks | ||
|
|
||
| ## Best Practices | ||
|
|
||
| 1. **Monitor regularly** - Check Flower dashboard daily during active development | ||
| 2. **Log memory-intensive tasks** - Add logging for tasks processing large datasets | ||
| 3. **Test limits locally** - Verify worker restarts work correctly in development | ||
| 4. **Set New Relic alerts** - Automate detection of worker issues | ||
| 5. **Review worker logs** - Look for patterns in restarts and failures | ||
| 6. **Tune limits based on reality** - Adjust after observing actual task behavior | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.