-
Notifications
You must be signed in to change notification settings - Fork 11
Add Celery worker memory and task limits for leak prevention #1051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implements worker-level protections to prevent memory leaks and resource exhaustion in long-running Celery workers processing ML tasks. Changes: - Add --max-tasks-per-child=100 to restart worker processes after 100 tasks - Add --max-memory-per-child=2000000 (2GB) to restart on memory threshold - Update RabbitMQ broker connection settings with heartbeat and timeouts - Add WORKER_MONITORING.md guide for detecting hung workers via Flower/New Relic Technical details: - Limits apply per worker process (prefork pool creates one per CPU) - 2GB chosen to handle 1GB max task size with 2x overhead - RabbitMQ heartbeat set to 60s to detect broken connections - Workers gracefully restart when limits hit (no task loss) Monitoring: - Use Flower to detect tasks stuck in running state - Use New Relic APM for memory usage alerts - Check Docker stats for worker container health - See WORKER_MONITORING.md for detailed guide Co-Authored-By: Claude <[email protected]>
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds per-child memory and task limits to local and production Celery worker start scripts, tightens Celery task time limits and adds RabbitMQ heartbeat and broker-retry settings in base settings, and introduces a new worker monitoring guide. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Supervisor
participant StartScript as "start (shell)"
participant Celery as "celery worker"
Note over StartScript: export MAX_TASKS_PER_CHILD=100\nexport MAX_MEMORY_PER_CHILD=1048576
Supervisor->>StartScript: execute start script
StartScript->>Celery: exec celery worker --max-tasks-per-child=100 --max-memory-per-child=1048576 ...
alt Task execution
Celery->>Celery: run tasks
note right of Celery `#DDFFDD`: children monitored for\n task count and resident memory
else Limit reached (task or memory)
Celery-->>StartScript: child exits due to limit
StartScript->>Celery: celery spawns new child (managed by celery)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
✅ Deploy Preview for antenna-preview canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements worker-level protections to prevent memory leaks and resource exhaustion in Celery workers processing ML tasks. It adds task and memory limits per worker child process, updates RabbitMQ broker connection settings for improved reliability, and provides comprehensive monitoring documentation.
Key changes:
- Worker processes now restart after 100 tasks or when exceeding 2GB memory to prevent leak accumulation
- RabbitMQ connection settings enhanced with 60-second heartbeat interval and improved timeout configurations
- Comprehensive monitoring guide added covering Flower, New Relic APM, and Docker-based health checks
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| docs/WORKER_MONITORING.md | New comprehensive guide for monitoring Celery worker health, detecting hung workers, and troubleshooting common issues |
| config/settings/base.py | Updated broker transport options with RabbitMQ-specific settings including heartbeat interval and clearer documentation |
| compose/production/django/celery/worker/start | Added --max-tasks-per-child=100 and --max-memory-per-child=2000000 flags to production worker startup |
| compose/local/django/celery/worker/start | Added same memory/task limits to local development worker with debugger support |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (7)
compose/local/django/celery/worker/start (1)
11-16: Clarify the memory limit value or adjust to exactly 2GB.The comment states "2GB (in KB)" but 2000000 KB ≈ 1.91 GB. For exactly 2GB, use 2097152 KB (2 × 1024 × 1024). If 1.91 GB is intentional, consider clarifying in the comment.
Apply this diff if exactly 2GB is desired:
-# --max-memory-per-child=2000000 - Restart if worker process exceeds 2GB (in KB) +# --max-memory-per-child=2097152 - Restart if worker process exceeds 2GB (in KB) # # With prefork pool (default), each CPU gets a worker process. -# Example: 8 CPUs × 2GB = 16GB max total worker memory +# Example: 8 CPUs × 2GB = 16GB max total worker memoryAnd update lines 21 and 23:
- exec python -Xfrozen_modules=off -m debugpy --listen 0.0.0.0:5679 -m celery -A config.celery_app worker --queues=antenna -l INFO --max-tasks-per-child=100 --max-memory-per-child=2000000 + exec python -Xfrozen_modules=off -m debugpy --listen 0.0.0.0:5679 -m celery -A config.celery_app worker --queues=antenna -l INFO --max-tasks-per-child=100 --max-memory-per-child=2097152- exec watchfiles --filter python celery.__main__.main --args '-A config.celery_app worker --queues=antenna -l INFO --max-tasks-per-child=100 --max-memory-per-child=2000000' + exec watchfiles --filter python celery.__main__.main --args '-A config.celery_app worker --queues=antenna -l INFO --max-tasks-per-child=100 --max-memory-per-child=2097152'docs/WORKER_MONITORING.md (5)
13-16: Conversion accuracy: 2000000 KB ≈ 1.91 GB, not 2GB.For consistency with the actual configuration, either update the value to 2097152 KB for exactly 2GB, or clarify that the limit is approximately 1.91 GB. This applies to both the documentation and the startup scripts.
22-22: Format URL as markdown link.Per markdown best practices, format the bare URL as a proper link.
Apply this diff:
-### 1. Flower (http://localhost:5555) +### 1. Flower + +Access Flower at: [http://localhost:5555](http://localhost:5555)
75-78: Add language specification to fenced code block.The code block should specify a language for proper syntax highlighting.
Apply this diff:
-``` +```text [INFO/MainProcess] Warm shutdown (MainProcess) [INFO/MainProcess] Celery worker: Restarting pool processes--- `94-94`: **Format URL as markdown link.** Per markdown best practices, format the bare URL as a proper link. Apply this diff: ```diff - - Check: http://localhost:15672 (RabbitMQ management UI) + - Check: [RabbitMQ management UI](http://localhost:15672)
136-136: Format URL as markdown link.Per markdown best practices, format the bare URL as a proper link.
Apply this diff:
-Access at: http://localhost:15672 +Access at: [http://localhost:15672](http://localhost:15672)compose/production/django/celery/worker/start (1)
7-16: Production script matches local with same KB conversion note.The protections are correctly applied and consistent with the local development script. However, the same KB to GB conversion issue exists here: 2000000 KB ≈ 1.91 GB, not exactly 2GB. If exactly 2GB is desired, use 2097152 KB.
Apply this diff for exactly 2GB:
-# --max-memory-per-child=2000000 - Restart if worker process exceeds 2GB (in KB) +# --max-memory-per-child=2097152 - Restart if worker process exceeds 2GB (in KB) # # With prefork pool (default), each CPU gets a worker process. # Example: 8 CPUs × 2GB = 16GB max total worker memory -exec newrelic-admin run-program celery -A config.celery_app worker --queues=antenna -l INFO --max-tasks-per-child=100 --max-memory-per-child=2000000 +exec newrelic-admin run-program celery -A config.celery_app worker --queues=antenna -l INFO --max-tasks-per-child=100 --max-memory-per-child=2097152
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
compose/local/django/celery/worker/start(1 hunks)compose/production/django/celery/worker/start(1 hunks)config/settings/base.py(1 hunks)docs/WORKER_MONITORING.md(1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/WORKER_MONITORING.md
22-22: Bare URL used
(MD034, no-bare-urls)
75-75: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
94-94: Bare URL used
(MD034, no-bare-urls)
136-136: Bare URL used
(MD034, no-bare-urls)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Redirect rules
- GitHub Check: Header rules
- GitHub Check: Pages changed
- GitHub Check: Agent
- GitHub Check: test
🔇 Additional comments (5)
docs/WORKER_MONITORING.md (1)
1-226: Excellent comprehensive monitoring documentation.This documentation provides valuable guidance for monitoring and troubleshooting Celery workers. The structure is clear, examples are practical, and it covers the most important operational concerns.
config/settings/base.py (2)
349-359: Good broker reliability improvements.The addition of
heartbeat: 60and updated RabbitMQ-specific comments improve connection reliability. The transport options are well-documented and appropriate for long-running ML tasks.
361-365: Infinite retry is appropriate with proper monitoring in place.Setting
CELERY_BROKER_CONNECTION_MAX_RETRIES = Noneensures workers don't crash on temporary broker issues. This is a good choice given the comprehensive monitoring documentation added in this PR. Ensure your alerting can detect workers stuck in connection retry loops.compose/production/django/celery/worker/start (1)
7-16: Excellent consistency between local and production environments.The worker protections are properly applied in production with the same values as local development, which ensures consistent behavior across environments.
compose/local/django/celery/worker/start (1)
21-23: Flags verified as valid Celery worker options.Both
--max-tasks-per-childand--max-memory-per-childare supported Celery worker flags with pool support for prefork. The values in your code (100 tasks and ~1.9 GB memory limit) are reasonable and correctly formatted. Both code paths maintain consistent configuration.
carlosgjs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of great documentation. I suggest addressing the copilot comment about the task timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
docs/WORKER_MONITORING.md (1)
162-165: Correct the documented time limit formulasThese formulas drop the seconds factor; as written they evaluate to ~1.6 hours instead of matching the 4‑day/3‑day limits configured in
config/settings/base.py. Please update the expressions so the documentation reflects the actual settings.-Current limits: -- `CELERY_TASK_TIME_LIMIT = 4 * 60 * 24` (4 days) -- `CELERY_TASK_SOFT_TIME_LIMIT = 3 * 60 * 24` (3 days) +Current limits: +- `CELERY_TASK_TIME_LIMIT = 4 * 60 * 60 * 24` (4 days) +- `CELERY_TASK_SOFT_TIME_LIMIT = 3 * 60 * 60 * 24` (3 days)
🧹 Nitpick comments (1)
config/settings/base.py (1)
352-359: Remove Redis-onlyvisibility_timeoutfrom RabbitMQ options
visibility_timeoutis only honored by Redis and SQS transports; RabbitMQ ignores it, so keeping it here adds confusion without providing the intended protection. Please drop this entry (and its comment) or relocate it to a Redis-specific configuration block.- "visibility_timeout": 43200, # 12 hours - how long tasks stay reserved before returning to queue
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
compose/local/django/celery/worker/start(1 hunks)compose/production/django/celery/worker/start(1 hunks)config/settings/base.py(2 hunks)docs/WORKER_MONITORING.md(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- compose/local/django/celery/worker/start
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/WORKER_MONITORING.md
22-22: Bare URL used
(MD034, no-bare-urls)
75-75: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
94-94: Bare URL used
(MD034, no-bare-urls)
136-136: Bare URL used
(MD034, no-bare-urls)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
docs/WORKER_MONITORING.md (3)
22-42: Good resolution of the time limit documentation.The documentation now correctly states the
CELERY_TASK_TIME_LIMITas 4 days, which matches the code inconfig/settings/base.pyline 326 (4 * 60 * 60 * 24= 345,600 seconds). The previous critical issue with incorrect time calculations has been resolved.Optional: Consider wrapping bare URLs in markdown syntax.
Line 22 has a bare URL. Per markdown best practices, wrap it in angle brackets or a link:
-### 1. Flower (http://localhost:5555) +### 1. Flower (<http://localhost:5555>)
75-78: Specify language for fenced code block.The code block showing worker restart patterns should specify a language identifier for proper syntax highlighting.
Apply this diff:
-``` +```text [INFO/MainProcess] Warm shutdown (MainProcess) [INFO/MainProcess] Celery worker: Restarting pool processes--- `94-94`: **Optional: Wrap bare URLs in markdown syntax.** Lines 94 and 136 contain bare URLs. Consider wrapping them for better markdown compliance: ```diff - - Check: http://localhost:15672 (RabbitMQ management UI) + - Check: <http://localhost:15672> (RabbitMQ management UI)-Access at: http://localhost:15672 +Access at: <http://localhost:15672>Also applies to: 136-136
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
config/settings/base.py(2 hunks)docs/WORKER_MONITORING.md(1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/WORKER_MONITORING.md
22-22: Bare URL used
(MD034, no-bare-urls)
75-75: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
94-94: Bare URL used
(MD034, no-bare-urls)
136-136: Bare URL used
(MD034, no-bare-urls)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test
🔇 Additional comments (5)
docs/WORKER_MONITORING.md (2)
162-165: Correct time limit calculations and documentation.The time limit calculations are now accurate:
- 4 days =
4 * 60 * 60 * 24= 345,600 seconds ✓- 3 days =
3 * 60 * 60 * 24= 259,200 seconds ✓This correctly matches the configuration in
config/settings/base.pylines 326 and 329.
13-13: No issues found—documentation is accurate.The verification confirms all memory limit values are correctly documented:
- Production: 2097152 KB (exactly 2 GiB) matches line 13 and line 204
- Local dev: 1048576 KB (exactly 1 GiB) matches line 204
The actual environment variable definitions in the worker start scripts confirm the documentation is accurate and consistent. The concern about "2,000,000 KB ≈ 2 GB" from the PR objectives does not apply to the current configuration.
Likely an incorrect or invalid review comment.
config/settings/base.py (3)
326-329: Correct time limit configuration.The time limits are now correctly calculated:
- Hard limit:
4 * 60 * 60 * 24= 345,600 seconds (4 days) ✓- Soft limit:
3 * 60 * 60 * 24= 259,200 seconds (3 days) ✓These generous limits are appropriate for long-running ML tasks and provide a 1-day buffer between soft and hard limits for graceful shutdown. The values correctly match the documentation in
docs/WORKER_MONITORING.md.
349-358: Well-configured RabbitMQ connection settings for reliability.The broker transport options are well-suited for preventing connection issues in long-running workers:
heartbeat: 60enables detection of broken connections within 60 seconds (aligns with monitoring guidance indocs/WORKER_MONITORING.mdline 31)socket_timeout: 120andsocket_connect_timeout: 30provide reasonable timeoutssocket_keepalive: Trueandretry_on_timeout: Trueimprove connection reliabilitymax_connections: 20is a sensible per-process limitThese settings directly address the PR's goal of preventing hanging workers and stuck jobs.
360-364: Appropriate broker retry configuration for resilience.Setting
CELERY_BROKER_CONNECTION_MAX_RETRIES = None(infinite retries) is a good practice for production workers, ensuring they remain operational during transient network issues or broker restarts rather than crashing permanently. Combined withCELERY_BROKER_CONNECTION_RETRY_ON_STARTUP = True, workers will reliably connect even if RabbitMQ is temporarily unavailable during deployment.
Summary
Simple updates to Celery & RabbitMQ settings in attempts to prevent hanging workers and stuck jobs.
Related Issues
#1041
Detailed Description
Implements worker-level protections to prevent memory leaks and resource exhaustion in long-running Celery workers processing ML tasks.
Changes:
Technical details:
Monitoring:
Deployment Notes
Run
docker compose buildon the worker servers & production appScreenshots
BEFORE CHANGES
32GB of RAM, 8 vCPUs
Memory goes up and down, no obvious leaks.
Worker server with no major jobs running

Servers one processing job running:
Web app server (runs one celery worker as well)

Worker server

In the pre-test, 500 captures processed fine without errors & the job status was updated correctly. It made 3,300 classifications.
https://antenna.insectai.org/projects/18/jobs/1999
Summary by CodeRabbit
New Features
Documentation