Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,11 @@ Antenna uses [Docker](https://docs.docker.com/get-docker/) & [Docker Compose](ht

# To stream the logs
docker compose logs -f django celeryworker ui-dev

# To stop the ui-dev container, you must specify the profile when running `down` or `stop`
docker compose --profile ui-dev down
# Or!
docker compose --profile "*" down
```
_**Note that this will create a `ui/node_modules` folder if one does not exist yet. This folder is created by the mounting of the `/ui` folder
for the `ui-dev` service, and is written by a `root` user.
Expand Down
67 changes: 67 additions & 0 deletions docs/WORKER_MONITORING.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,73 @@ Access at: http://localhost:15672

## Troubleshooting Common Issues

### Workers appear connected but tasks don't execute

**Symptoms:**
- Worker logs show "Connected to amqp://..." and "celery@... ready"
- `celery inspect` times out: "No nodes replied within time constraint"
- Flower shows "no workers connected"
- Task publishing hangs indefinitely
- RabbitMQ UI shows connections in "blocked" state

**Possible cause: RabbitMQ Disk Space Alarm**

When RabbitMQ runs low on disk space, it triggers an alarm and **blocks ALL connections** from publishing or consuming. This alarm is not prominently displayed in standard monitoring.

**Diagnosis:**

1. Check RabbitMQ Management UI (http://rabbitmq-server:15672) → Connections tab
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent RabbitMQ Management UI URL. Line 136 uses http://localhost:15672 while line 158 uses http://rabbitmq-server:15672.

For consistency and clarity, both should use the same URL. If the UI is accessed from the host machine (outside Docker), localhost is correct. If accessed from within Docker containers, rabbitmq-server would be correct.

Based on the context at line 136 which shows "Access at: http://localhost:15672", the URL at line 158 should likely be:

http://localhost:15672
Suggested change
1. Check RabbitMQ Management UI (http://rabbitmq-server:15672) → Connections tab
1. Check RabbitMQ Management UI (http://localhost:15672) → Connections tab

Copilot uses AI. Check for mistakes.
- Look for State = "blocked" or "blocking"

2. Check for active alarms on RabbitMQ server:
```bash
rabbitmqctl list_alarms
# Note: "rabbitmqctl status | grep alarms" is unreliable
```

3. Check disk space:
```bash
df -h
```

4. Check RabbitMQ logs:
```bash
journalctl -u rabbitmq-server -n 100 | grep -i "alarm\|block"
```

**Resolution:**

1. Free up disk space on RabbitMQ server
2. Verify alarm cleared: `rabbitmqctl list_alarms`
3. Adjust disk limit if needed: `rabbitmqctl set_disk_free_limit 5GB`
4. Restart RabbitMQ: `systemctl restart rabbitmq-server`
5. Restart workers: `docker compose restart celeryworker`

**Prevention:**
- Monitor disk space on RabbitMQ server (alert at 80% usage)
- Set reasonable disk free limit: `rabbitmqctl set_disk_free_limit 5GB`
- Configure log rotation for RabbitMQ logs
- Purge stale queues regularly (see below)

### Stale worker queues breaking celery inspect

**Symptoms:**
- `celery inspect` times out even after fixing RabbitMQ issues
- Multiple `celery@<old-container-id>.celery.pidbox` queues in RabbitMQ

**Cause:**
Worker restarts create new pidbox control queues but old ones persist. `celery inspect` broadcasts to ALL and waits, timing out on dead workers.

**Resolution:**
1. Go to RabbitMQ Management UI → Queues
2. Delete old `celery@<old-container-id>.celery.pidbox` queues
3. Keep only current worker's pidbox queue

**Alternative:** Target specific worker:
```bash
celery -A config.celery_app inspect stats -d celery@<current-worker-id>
```

### Worker keeps restarting every 100 tasks

**This is normal behavior** with `--max-tasks-per-child=100`.
Expand Down