Skip to content

Conversation

@mihow
Copy link
Collaborator

@mihow mihow commented Nov 14, 2025

Summary

Simple updates to Celery & RabbitMQ settings in attempts to prevent hanging workers and stuck jobs.

Related Issues

#1041

Detailed Description

Implements worker-level protections to prevent memory leaks and resource exhaustion in long-running Celery workers processing ML tasks.

Changes:

  • Add --max-tasks-per-child=100 to restart worker processes after 100 tasks
  • Add --max-memory-per-child=2000000 (2GB) to restart on memory threshold
  • Update RabbitMQ broker connection settings with heartbeat and timeouts
  • Add WORKER_MONITORING.md guide for detecting hung workers via Flower/New Relic

Technical details:

  • Limits apply per worker process (prefork pool creates one per CPU)
  • 2GB chosen to handle 1GB max task size with 2x overhead
  • RabbitMQ heartbeat set to 60s to detect broken connections
  • Workers gracefully restart when limits hit (no task loss)

Monitoring:

  • Use Flower to manually look for tasks stuck in running state
  • Use New Relic APM for memory usage alerts
  • Check Docker stats for worker container health
  • See WORKER_MONITORING.md for detailed guide

Deployment Notes

Run docker compose build on the worker servers & production app

Screenshots

BEFORE CHANGES
32GB of RAM, 8 vCPUs

Memory goes up and down, no obvious leaks.

Worker server with no major jobs running
image

Servers one processing job running:

Web app server (runs one celery worker as well)
image

Worker server
image

In the pre-test, 500 captures processed fine without errors & the job status was updated correctly. It made 3,300 classifications.
https://antenna.insectai.org/projects/18/jobs/1999

image

Summary by CodeRabbit

  • New Features

    • Celery workers now include automatic memory and per-worker task limits to improve reliability.
    • Broker connection resilience enhanced (heartbeat added; startup and unlimited retry options).
    • Default task time limits shortened to reduce runaway work.
  • Documentation

    • Added a Worker Monitoring guide covering protections, monitoring tools, troubleshooting, scaling, and tuning.

Implements worker-level protections to prevent memory leaks and resource
exhaustion in long-running Celery workers processing ML tasks.

Changes:
- Add --max-tasks-per-child=100 to restart worker processes after 100 tasks
- Add --max-memory-per-child=2000000 (2GB) to restart on memory threshold
- Update RabbitMQ broker connection settings with heartbeat and timeouts
- Add WORKER_MONITORING.md guide for detecting hung workers via Flower/New Relic

Technical details:
- Limits apply per worker process (prefork pool creates one per CPU)
- 2GB chosen to handle 1GB max task size with 2x overhead
- RabbitMQ heartbeat set to 60s to detect broken connections
- Workers gracefully restart when limits hit (no task loss)

Monitoring:
- Use Flower to detect tasks stuck in running state
- Use New Relic APM for memory usage alerts
- Check Docker stats for worker container health
- See WORKER_MONITORING.md for detailed guide

Co-Authored-By: Claude <[email protected]>
Copilot AI review requested due to automatic review settings November 14, 2025 07:30
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 14, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds per-child memory and task limits to local and production Celery worker start scripts, tightens Celery task time limits and adds RabbitMQ heartbeat and broker-retry settings in base settings, and introduces a new worker monitoring guide.

Changes

Cohort / File(s) Summary
Local & Production worker startup scripts
compose/local/django/celery/worker/start, compose/production/django/celery/worker/start
Introduced MAX_TASKS_PER_CHILD and MAX_MEMORY_PER_CHILD variables (values added), passed --max-tasks-per-child and --max-memory-per-child to the celery worker in both debug and non-debug flows, and added explanatory comments about memory protections.
Celery configuration
config/settings/base.py
Reduced Celery task time limits (CELERY_TASK_TIME_LIMIT → 4 days, CELERY_TASK_SOFT_TIME_LIMIT → 3 days); added "heartbeat": 60 to CELERY_BROKER_TRANSPORT_OPTIONS; added broker retry settings: CELERY_BROKER_CONNECTION_RETRY, CELERY_BROKER_CONNECTION_RETRY_ON_STARTUP, and CELERY_BROKER_CONNECTION_MAX_RETRIES.
Documentation
docs/WORKER_MONITORING.md
New worker monitoring guide covering worker protections (max-tasks-per-child, max-memory-per-child), monitoring tools and setups, detecting hung workers, troubleshooting steps, and tuning/scaling recommendations.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Supervisor
  participant StartScript as "start (shell)"
  participant Celery as "celery worker"
  Note over StartScript: export MAX_TASKS_PER_CHILD=100\nexport MAX_MEMORY_PER_CHILD=1048576
  Supervisor->>StartScript: execute start script
  StartScript->>Celery: exec celery worker --max-tasks-per-child=100 --max-memory-per-child=1048576 ...
  alt Task execution
    Celery->>Celery: run tasks
    note right of Celery `#DDFFDD`: children monitored for\n task count and resident memory
  else Limit reached (task or memory)
    Celery-->>StartScript: child exits due to limit
    StartScript->>Celery: celery spawns new child (managed by celery)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Verify units and intended value of MAX_MEMORY_PER_CHILD (value 1048576 vs. comment claiming 1 GiB).
  • Confirm Celery time-limit arithmetic and expected behaviour for long-running tasks.
  • Review broker heartbeat and retry settings for compatibility with the deployed RabbitMQ configuration.

Possibly related PRs

Suggested reviewers

  • carlosgjs

Poem

🐇
I tweak some limits with gentle paws,
so workers sleep before the cause.
Heartbeats hum and monitors sing,
new guards to keep the celery spring.
Hop, hop — safe fields and brighter logs! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main objective: adding worker memory and task limits to prevent leaks, which is the primary change across multiple files.
Description check ✅ Passed The description includes all major template sections: Summary, Related Issues, Detailed Description with technical details, Deployment Notes, and Screenshots with testing evidence. Minor sections like How to Test and Checklist are missing, but the provided information is comprehensive.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/worker-protections-simple

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@netlify
Copy link

netlify bot commented Nov 14, 2025

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit 7ac1de1
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/691944730f384f0008cf1f9e

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements worker-level protections to prevent memory leaks and resource exhaustion in Celery workers processing ML tasks. It adds task and memory limits per worker child process, updates RabbitMQ broker connection settings for improved reliability, and provides comprehensive monitoring documentation.

Key changes:

  • Worker processes now restart after 100 tasks or when exceeding 2GB memory to prevent leak accumulation
  • RabbitMQ connection settings enhanced with 60-second heartbeat interval and improved timeout configurations
  • Comprehensive monitoring guide added covering Flower, New Relic APM, and Docker-based health checks

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
docs/WORKER_MONITORING.md New comprehensive guide for monitoring Celery worker health, detecting hung workers, and troubleshooting common issues
config/settings/base.py Updated broker transport options with RabbitMQ-specific settings including heartbeat interval and clearer documentation
compose/production/django/celery/worker/start Added --max-tasks-per-child=100 and --max-memory-per-child=2000000 flags to production worker startup
compose/local/django/celery/worker/start Added same memory/task limits to local development worker with debugger support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (7)
compose/local/django/celery/worker/start (1)

11-16: Clarify the memory limit value or adjust to exactly 2GB.

The comment states "2GB (in KB)" but 2000000 KB ≈ 1.91 GB. For exactly 2GB, use 2097152 KB (2 × 1024 × 1024). If 1.91 GB is intentional, consider clarifying in the comment.

Apply this diff if exactly 2GB is desired:

-# --max-memory-per-child=2000000 - Restart if worker process exceeds 2GB (in KB)
+# --max-memory-per-child=2097152 - Restart if worker process exceeds 2GB (in KB)
 #
 # With prefork pool (default), each CPU gets a worker process.
-# Example: 8 CPUs × 2GB = 16GB max total worker memory
+# Example: 8 CPUs × 2GB = 16GB max total worker memory

And update lines 21 and 23:

-    exec python -Xfrozen_modules=off -m debugpy --listen 0.0.0.0:5679 -m celery -A config.celery_app worker --queues=antenna -l INFO --max-tasks-per-child=100 --max-memory-per-child=2000000
+    exec python -Xfrozen_modules=off -m debugpy --listen 0.0.0.0:5679 -m celery -A config.celery_app worker --queues=antenna -l INFO --max-tasks-per-child=100 --max-memory-per-child=2097152
-    exec watchfiles --filter python celery.__main__.main --args '-A config.celery_app worker --queues=antenna -l INFO --max-tasks-per-child=100 --max-memory-per-child=2000000'
+    exec watchfiles --filter python celery.__main__.main --args '-A config.celery_app worker --queues=antenna -l INFO --max-tasks-per-child=100 --max-memory-per-child=2097152'
docs/WORKER_MONITORING.md (5)

13-16: Conversion accuracy: 2000000 KB ≈ 1.91 GB, not 2GB.

For consistency with the actual configuration, either update the value to 2097152 KB for exactly 2GB, or clarify that the limit is approximately 1.91 GB. This applies to both the documentation and the startup scripts.


22-22: Format URL as markdown link.

Per markdown best practices, format the bare URL as a proper link.

Apply this diff:

-### 1. Flower (http://localhost:5555)
+### 1. Flower
+
+Access Flower at: [http://localhost:5555](http://localhost:5555)

75-78: Add language specification to fenced code block.

The code block should specify a language for proper syntax highlighting.

Apply this diff:

-```
+```text
 [INFO/MainProcess] Warm shutdown (MainProcess)
 [INFO/MainProcess] Celery worker: Restarting pool processes

---

`94-94`: **Format URL as markdown link.**

Per markdown best practices, format the bare URL as a proper link.



Apply this diff:

```diff
-   - Check: http://localhost:15672 (RabbitMQ management UI)
+   - Check: [RabbitMQ management UI](http://localhost:15672)

136-136: Format URL as markdown link.

Per markdown best practices, format the bare URL as a proper link.

Apply this diff:

-Access at: http://localhost:15672
+Access at: [http://localhost:15672](http://localhost:15672)
compose/production/django/celery/worker/start (1)

7-16: Production script matches local with same KB conversion note.

The protections are correctly applied and consistent with the local development script. However, the same KB to GB conversion issue exists here: 2000000 KB ≈ 1.91 GB, not exactly 2GB. If exactly 2GB is desired, use 2097152 KB.

Apply this diff for exactly 2GB:

-# --max-memory-per-child=2000000 - Restart if worker process exceeds 2GB (in KB)
+# --max-memory-per-child=2097152 - Restart if worker process exceeds 2GB (in KB)
 #
 # With prefork pool (default), each CPU gets a worker process.
 # Example: 8 CPUs × 2GB = 16GB max total worker memory
 
-exec newrelic-admin run-program celery -A config.celery_app worker --queues=antenna -l INFO --max-tasks-per-child=100 --max-memory-per-child=2000000
+exec newrelic-admin run-program celery -A config.celery_app worker --queues=antenna -l INFO --max-tasks-per-child=100 --max-memory-per-child=2097152
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b461fb and 712f579.

📒 Files selected for processing (4)
  • compose/local/django/celery/worker/start (1 hunks)
  • compose/production/django/celery/worker/start (1 hunks)
  • config/settings/base.py (1 hunks)
  • docs/WORKER_MONITORING.md (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/WORKER_MONITORING.md

22-22: Bare URL used

(MD034, no-bare-urls)


75-75: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


94-94: Bare URL used

(MD034, no-bare-urls)


136-136: Bare URL used

(MD034, no-bare-urls)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Redirect rules
  • GitHub Check: Header rules
  • GitHub Check: Pages changed
  • GitHub Check: Agent
  • GitHub Check: test
🔇 Additional comments (5)
docs/WORKER_MONITORING.md (1)

1-226: Excellent comprehensive monitoring documentation.

This documentation provides valuable guidance for monitoring and troubleshooting Celery workers. The structure is clear, examples are practical, and it covers the most important operational concerns.

config/settings/base.py (2)

349-359: Good broker reliability improvements.

The addition of heartbeat: 60 and updated RabbitMQ-specific comments improve connection reliability. The transport options are well-documented and appropriate for long-running ML tasks.


361-365: Infinite retry is appropriate with proper monitoring in place.

Setting CELERY_BROKER_CONNECTION_MAX_RETRIES = None ensures workers don't crash on temporary broker issues. This is a good choice given the comprehensive monitoring documentation added in this PR. Ensure your alerting can detect workers stuck in connection retry loops.

compose/production/django/celery/worker/start (1)

7-16: Excellent consistency between local and production environments.

The worker protections are properly applied in production with the same values as local development, which ensures consistent behavior across environments.

compose/local/django/celery/worker/start (1)

21-23: Flags verified as valid Celery worker options.

Both --max-tasks-per-child and --max-memory-per-child are supported Celery worker flags with pool support for prefork. The values in your code (100 tasks and ~1.9 GB memory limit) are reasonable and correctly formatted. Both code paths maintain consistent configuration.

Copy link
Collaborator

@carlosgjs carlosgjs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of great documentation. I suggest addressing the copilot comment about the task timeout.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
docs/WORKER_MONITORING.md (1)

162-165: Correct the documented time limit formulas

These formulas drop the seconds factor; as written they evaluate to ~1.6 hours instead of matching the 4‑day/3‑day limits configured in config/settings/base.py. Please update the expressions so the documentation reflects the actual settings.

-Current limits:
-- `CELERY_TASK_TIME_LIMIT = 4 * 60 * 24` (4 days)
-- `CELERY_TASK_SOFT_TIME_LIMIT = 3 * 60 * 24` (3 days)
+Current limits:
+- `CELERY_TASK_TIME_LIMIT = 4 * 60 * 60 * 24` (4 days)
+- `CELERY_TASK_SOFT_TIME_LIMIT = 3 * 60 * 60 * 24` (3 days)
🧹 Nitpick comments (1)
config/settings/base.py (1)

352-359: Remove Redis-only visibility_timeout from RabbitMQ options

visibility_timeout is only honored by Redis and SQS transports; RabbitMQ ignores it, so keeping it here adds confusion without providing the intended protection. Please drop this entry (and its comment) or relocate it to a Redis-specific configuration block.

-    "visibility_timeout": 43200,  # 12 hours - how long tasks stay reserved before returning to queue

(airflow.apache.org)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 712f579 and 515ce53.

📒 Files selected for processing (4)
  • compose/local/django/celery/worker/start (1 hunks)
  • compose/production/django/celery/worker/start (1 hunks)
  • config/settings/base.py (2 hunks)
  • docs/WORKER_MONITORING.md (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • compose/local/django/celery/worker/start
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/WORKER_MONITORING.md

22-22: Bare URL used

(MD034, no-bare-urls)


75-75: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


94-94: Bare URL used

(MD034, no-bare-urls)


136-136: Bare URL used

(MD034, no-bare-urls)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
docs/WORKER_MONITORING.md (3)

22-42: Good resolution of the time limit documentation.

The documentation now correctly states the CELERY_TASK_TIME_LIMIT as 4 days, which matches the code in config/settings/base.py line 326 (4 * 60 * 60 * 24 = 345,600 seconds). The previous critical issue with incorrect time calculations has been resolved.

Optional: Consider wrapping bare URLs in markdown syntax.

Line 22 has a bare URL. Per markdown best practices, wrap it in angle brackets or a link:

-### 1. Flower (http://localhost:5555)
+### 1. Flower (<http://localhost:5555>)

75-78: Specify language for fenced code block.

The code block showing worker restart patterns should specify a language identifier for proper syntax highlighting.

Apply this diff:

-```
+```text
 [INFO/MainProcess] Warm shutdown (MainProcess)
 [INFO/MainProcess] Celery worker: Restarting pool processes

---

`94-94`: **Optional: Wrap bare URLs in markdown syntax.**

Lines 94 and 136 contain bare URLs. Consider wrapping them for better markdown compliance:

```diff
-   - Check: http://localhost:15672 (RabbitMQ management UI)
+   - Check: <http://localhost:15672> (RabbitMQ management UI)
-Access at: http://localhost:15672
+Access at: <http://localhost:15672>

Also applies to: 136-136

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 515ce53 and 7ac1de1.

📒 Files selected for processing (2)
  • config/settings/base.py (2 hunks)
  • docs/WORKER_MONITORING.md (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/WORKER_MONITORING.md

22-22: Bare URL used

(MD034, no-bare-urls)


75-75: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


94-94: Bare URL used

(MD034, no-bare-urls)


136-136: Bare URL used

(MD034, no-bare-urls)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test
🔇 Additional comments (5)
docs/WORKER_MONITORING.md (2)

162-165: Correct time limit calculations and documentation.

The time limit calculations are now accurate:

  • 4 days = 4 * 60 * 60 * 24 = 345,600 seconds ✓
  • 3 days = 3 * 60 * 60 * 24 = 259,200 seconds ✓

This correctly matches the configuration in config/settings/base.py lines 326 and 329.


13-13: No issues found—documentation is accurate.

The verification confirms all memory limit values are correctly documented:

  • Production: 2097152 KB (exactly 2 GiB) matches line 13 and line 204
  • Local dev: 1048576 KB (exactly 1 GiB) matches line 204

The actual environment variable definitions in the worker start scripts confirm the documentation is accurate and consistent. The concern about "2,000,000 KB ≈ 2 GB" from the PR objectives does not apply to the current configuration.

Likely an incorrect or invalid review comment.

config/settings/base.py (3)

326-329: Correct time limit configuration.

The time limits are now correctly calculated:

  • Hard limit: 4 * 60 * 60 * 24 = 345,600 seconds (4 days) ✓
  • Soft limit: 3 * 60 * 60 * 24 = 259,200 seconds (3 days) ✓

These generous limits are appropriate for long-running ML tasks and provide a 1-day buffer between soft and hard limits for graceful shutdown. The values correctly match the documentation in docs/WORKER_MONITORING.md.


349-358: Well-configured RabbitMQ connection settings for reliability.

The broker transport options are well-suited for preventing connection issues in long-running workers:

  • heartbeat: 60 enables detection of broken connections within 60 seconds (aligns with monitoring guidance in docs/WORKER_MONITORING.md line 31)
  • socket_timeout: 120 and socket_connect_timeout: 30 provide reasonable timeouts
  • socket_keepalive: True and retry_on_timeout: True improve connection reliability
  • max_connections: 20 is a sensible per-process limit

These settings directly address the PR's goal of preventing hanging workers and stuck jobs.


360-364: Appropriate broker retry configuration for resilience.

Setting CELERY_BROKER_CONNECTION_MAX_RETRIES = None (infinite retries) is a good practice for production workers, ensuring they remain operational during transient network issues or broker restarts rather than crashing permanently. Combined with CELERY_BROKER_CONNECTION_RETRY_ON_STARTUP = True, workers will reliably connect even if RabbitMQ is temporarily unavailable during deployment.

@mihow mihow merged commit 680fca7 into main Nov 16, 2025
7 checks passed
@mihow mihow deleted the feat/worker-protections-simple branch November 16, 2025 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants