Skip to content

fix: Revert supervisor port to 9000 and standardize SUPERVISOR_PORT (issue #11)#12

Open
syumpx wants to merge 6 commits intomainfrom
fix/issue-11-supervisor-port-9000
Open

fix: Revert supervisor port to 9000 and standardize SUPERVISOR_PORT (issue #11)#12
syumpx wants to merge 6 commits intomainfrom
fix/issue-11-supervisor-port-9000

Conversation

@syumpx
Copy link
Copy Markdown
Member

@syumpx syumpx commented Oct 23, 2025

Problem

Supervisor port was incorrectly changed from 9000 to 8080, breaking all agent deployments.

Error from PAC:

{"error_message":"No instances with available capacity"}

Root Cause:

  1. Original EC2 config: PORT=9000 (wrong variable name, but worked because code defaults to 9000)
  2. Recent "fix" in PR fix: Verify and repair home directory ownership in user_manager (issue #9) #10: Changed to SUPERVISOR_PORT=8080 (correct variable, wrong port)
  3. Security group: Allows port 9000 from VPC, blocks 8080
  4. PAC: Reaches supervisor via private IP on port 9000 (not through ALB)
  5. Restart: Killed all running agents

Solution

Reverted everything to port 9000 and standardized on SUPERVISOR_PORT everywhere.

Changes Made

EC2 Configuration (Applied via SSM) ✅

# Fixed config file
SUPERVISOR_PORT=9000  # was 8080

# Restarted service
sudo systemctl restart par-supervisor

# Verified
curl http://localhost:9000/health
# {"status":"healthy","available":200}

Deploy Script (scripts/deploy_ec2_par.sh)

  • Line 239: Set SUPERVISOR_PORT=9000 (was 8080)
  • Lines 337-343: Health check URLs → :9000
  • Line 356: Documentation → Port: 9000
  • Lines 380, 383, 386: Example commands → :9000
  • Line 394: Test deployment → :9000

Verification

All files now consistent on port 9000:

  • src/pixell_runtime/supervisor/__main__.py: os.getenv("SUPERVISOR_PORT", "9000")
  • docs/SUPERVISOR_README.md: Documents port 9000
  • systemd/pixell-supervisor.service: Environment="SUPERVISOR_PORT=9000"
  • scripts/deploy_ec2_par.sh: Uses port 9000 everywhere

Testing

Completed:

  • EC2 config updated: SUPERVISOR_PORT=9000
  • Supervisor restarted successfully
  • Service active: systemctl is-active par-supervisoractive
  • Listening on port 9000: ss -tlnp | grep 9000 → ✅
  • Health check: curl http://localhost:9000/health{"status":"healthy"}
  • Capacity available: 200 slots

Next Steps:

  • Merge this PR
  • Redeploy agents (PAF-Core, vivid-commenter) via PAC
  • Verify agents deploy successfully

Impact

  • ✅ Fixes broken agent deployments
  • ✅ Restores supervisor to working state
  • ✅ Standardizes port configuration across codebase
  • ✅ Agents can now be deployed again

Fixes #11

🤖 Generated with Claude Code

syumpx and others added 6 commits October 20, 2025 10:58
Fixes permission denied errors when agents try to extract packages to
/tmp/pixell_packages/. The issue occurred because:
1. The shared directory didn't exist or had restrictive permissions
2. Extracted package directories were owned by the wrong user

Changes:
- SupervisorState: Initialize /tmp/pixell_packages with 1777 permissions
  at startup, allowing all agent users to create subdirectories
- ProcessManager: Fix ownership of existing extracted packages before
  spawning agents to ensure agent user can read them
- PackageLoader: Improve error messaging for permission errors with
  actionable guidance

Tests:
- Added 7 new tests covering shared directory initialization and
  package ownership fixing
- All 48 tests passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit resolves three related issues with zombie process handling
in the supervisor:

Issue #4 - Zombie Process Reaping:
- Added background task _reap_zombies_task() that runs every 5 seconds
- Calls os.waitpid(-1, os.WNOHANG) to reap all zombie processes
- Updates agent status to FAILED when zombie detected
- Logs exit codes and signals for debugging

Issue #5 - False Health Status:
- Added is_process_zombie() using psutil for cross-platform detection
- Added get_process_health() returning comprehensive process metrics
- Updated is_running() to exclude zombies (returns False for zombies)
- Updated GET /agents/:id endpoint to report real-time zombie status

Issue #6 - Zombie Cleanup in DELETE/DEPLOY:
- Added _cleanup_process_manager_state() helper method (idempotent)
- Updated delete() to force-clean zombies from process_manager state
- Updated deploy() to auto-detect and cleanup dead/zombie agents
- Enables transparent recovery without PAC awareness

Key Features:
- Cross-platform zombie detection using psutil
- Idempotent cleanup safe to call multiple times
- Auto-recovery in DEPLOY operation
- Comprehensive test coverage (35 tests total)

Tests Added:
- tests/test_supervisor_zombie_reaping.py (10 tests)
- tests/test_supervisor_zombie_health.py (18 tests, 3 skipped)
- tests/test_supervisor_zombie_cleanup.py (10 tests)
- tests/test_supervisor_server.py (added 3 endpoint tests)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…ts (issue #7)

This fixes the permission denied error when multiple agents try to create
venvs in the shared /tmp/venvs directory.

## Problem

- Supervisor uses PrivateTmp=true, creating isolated /tmp namespace
- First agent creates /tmp/venvs/ owned by its user (agent_xxx)
- Second agent fails to create venvs due to ownership mismatch
- python3.11 -m venv fails with "Permission denied"

## Solution

Move venvs from shared /tmp/venvs to each agent's home directory:
- Venvs: $HOME/.pixell/venvs/ (e.g., /home/agent_xxx/.pixell/venvs/)
- Pip cache: $HOME/.cache/pip/ (standard XDG location)

## Changes

1. runtime.py:
   - Use HOME env var for venvs directory
   - Pass agent_app_id to PackageLoader for venv isolation

2. loader.py:
   - Use HOME/.cache/pip for pip cache (XDG standard)
   - Prevents permission conflicts between agents

3. process_manager.py:
   - Extract hardcoded /tmp/pixell_packages path to variable

## Benefits

✅ Perfect isolation between agents
✅ No permission conflicts
✅ Follows XDG Base Directory spec
✅ Auto-cleanup when agent user deleted
✅ Works with systemd PrivateTmp
✅ Venvs survive supervisor restarts

Closes #7

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…#9)

## Problem
When agents are redeployed and the Linux user already exists, the supervisor
doesn't verify that the home directory has correct ownership. This causes
agents to fail when trying to create files in their home directory.

Example error:
[Errno 13] Permission denied: '/home/agent_8c82966883524dad_4906eeb7/.pixell'

Root cause:
- Legacy deployments created home directories owned by root:root
- user_manager.create_user() returns early if user exists (line 114-116)
- Never checks or repairs home directory ownership

## Solution

### Code Changes
Modified src/pixell_runtime/supervisor/user_manager.py:
- Added os import for stat() system call
- Added ownership verification in create_user() when user exists
- Auto-repair with chown -R and chmod 0700 if owned by root (UID 0)
- Non-blocking error handling (logs but continues if repair fails)
- Added support for short IDs in username generation (bonus feature)

### Testing
Created tests/test_supervisor_user_manager.py:
- 17 comprehensive unit tests covering:
  - Ownership already correct (no-op)
  - Ownership repaired successfully
  - Error handling when repair fails
  - Edge cases (home doesn't exist, stat fails, etc.)
  - Short IDs support
  - Logging verification
  - Idempotency
  - Timeout parameters

All tests passing.

## Manual Fix Applied
Also executed SSM command to fix vivid-commenter home directory immediately:
sudo chown -R agent_8c82966883524dad_4906eeb7:agent_8c82966883524dad_4906eeb7 /home/agent_8c82966883524dad_4906eeb7
sudo chmod 0700 /home/agent_8c82966883524dad_4906eeb7

Verified: drwx------. 6 agent_xxx agent_xxx 60 /home/agent_8c82966883524dad_4906eeb7

Fixes #9

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
## Problem
The deploy script was creating `/etc/par-supervisor.conf` with `PORT=9000`,
but the supervisor code reads `SUPERVISOR_PORT` environment variable.
This caused the supervisor to default to port 9000 instead of 8080,
breaking ALB health checks and making agents appear offline.

## Root Cause
1. Config file set: `PORT=9000` (wrong variable name)
2. Supervisor reads: `os.getenv("SUPERVISOR_PORT", "9000")`
3. Since SUPERVISOR_PORT was not set, defaulted to 9000
4. ALB target group expects port 8080

## Solution
Modified scripts/deploy_ec2_par.sh to:
1. Use sed to replace PORT with SUPERVISOR_PORT in config file
2. Set value to 8080 (not 9000)
3. Updated all hardcoded port references from 9000 to 8080 in:
   - Health check endpoints
   - Example commands
   - Documentation

## Changes
- Line 239: Added sed command to fix environment variable name
- Lines 337-343: Updated health check from port 9000 to 8080
- Line 356: Updated documentation port reference
- Lines 380-386: Updated example commands to use port 8080
- Line 394: Updated test agent deployment example

## Testing
Deployed to i-09dcb7f387166efd0 and verified:
- Config file now has: SUPERVISOR_PORT=8080
- Supervisor listening on: 0.0.0.0:8080
- Health endpoint responding: http://localhost:8080/health

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…issue #11)

## Problem
Supervisor port was incorrectly changed from 9000 to 8080, breaking agent deployments.

**Error:**
```
{"error_message":"No instances with available capacity"}
```

**Root Cause:**
1. Original config: `PORT=9000` (wrong variable name, but worked via default)
2. Recent "fix": Changed to `SUPERVISOR_PORT=8080` (right variable, wrong port)
3. Security group: Allows port 9000, blocks 8080
4. PAC: Reaches supervisor via private IP on port 9000 (not ALB)
5. Restart: Killed all running agents

## Solution
Reverted to port 9000 and standardized on `SUPERVISOR_PORT` everywhere.

### Changes Made

**EC2 Configuration (via SSM):**
- Fixed `/etc/par-supervisor.conf`: `SUPERVISOR_PORT=9000`
- Restarted supervisor service
- Verified: Listening on port 9000 ✅

**Deploy Script (scripts/deploy_ec2_par.sh):**
- Line 239: Set `SUPERVISOR_PORT=9000` (was 8080)
- Line 240: Updated echo message
- Lines 337, 340, 343: Health check URLs → :9000
- Line 356: Documentation → Port 9000
- Lines 380, 383, 386: Example commands → :9000
- Line 394: Test deployment → :9000

### Consistency Verified
All files now consistent with port 9000:
- ✅ `src/pixell_runtime/supervisor/__main__.py`: Defaults to 9000
- ✅ `docs/SUPERVISOR_README.md`: States port 9000
- ✅ `systemd/pixell-supervisor.service`: Has SUPERVISOR_PORT=9000
- ✅ `scripts/deploy_ec2_par.sh`: Uses port 9000 everywhere

## Testing
- [x] EC2 config updated to SUPERVISOR_PORT=9000
- [x] Supervisor restarted successfully
- [x] Health check: `curl http://localhost:9000/health` ✅
- [x] Status: `{"status":"healthy","available":200}` ✅
- [ ] Agents can now be redeployed (next step for PAC)

Fixes #11

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: Revert supervisor port to 9000 and standardize SUPERVISOR_PORT everywhere

1 participant