fix: Verify and repair home directory ownership in user_manager (issue #9)#10
Open
fix: Verify and repair home directory ownership in user_manager (issue #9)#10
Conversation
Fixes permission denied errors when agents try to extract packages to /tmp/pixell_packages/. The issue occurred because: 1. The shared directory didn't exist or had restrictive permissions 2. Extracted package directories were owned by the wrong user Changes: - SupervisorState: Initialize /tmp/pixell_packages with 1777 permissions at startup, allowing all agent users to create subdirectories - ProcessManager: Fix ownership of existing extracted packages before spawning agents to ensure agent user can read them - PackageLoader: Improve error messaging for permission errors with actionable guidance Tests: - Added 7 new tests covering shared directory initialization and package ownership fixing - All 48 tests passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit resolves three related issues with zombie process handling in the supervisor: Issue #4 - Zombie Process Reaping: - Added background task _reap_zombies_task() that runs every 5 seconds - Calls os.waitpid(-1, os.WNOHANG) to reap all zombie processes - Updates agent status to FAILED when zombie detected - Logs exit codes and signals for debugging Issue #5 - False Health Status: - Added is_process_zombie() using psutil for cross-platform detection - Added get_process_health() returning comprehensive process metrics - Updated is_running() to exclude zombies (returns False for zombies) - Updated GET /agents/:id endpoint to report real-time zombie status Issue #6 - Zombie Cleanup in DELETE/DEPLOY: - Added _cleanup_process_manager_state() helper method (idempotent) - Updated delete() to force-clean zombies from process_manager state - Updated deploy() to auto-detect and cleanup dead/zombie agents - Enables transparent recovery without PAC awareness Key Features: - Cross-platform zombie detection using psutil - Idempotent cleanup safe to call multiple times - Auto-recovery in DEPLOY operation - Comprehensive test coverage (35 tests total) Tests Added: - tests/test_supervisor_zombie_reaping.py (10 tests) - tests/test_supervisor_zombie_health.py (18 tests, 3 skipped) - tests/test_supervisor_zombie_cleanup.py (10 tests) - tests/test_supervisor_server.py (added 3 endpoint tests) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…ts (issue #7) This fixes the permission denied error when multiple agents try to create venvs in the shared /tmp/venvs directory. ## Problem - Supervisor uses PrivateTmp=true, creating isolated /tmp namespace - First agent creates /tmp/venvs/ owned by its user (agent_xxx) - Second agent fails to create venvs due to ownership mismatch - python3.11 -m venv fails with "Permission denied" ## Solution Move venvs from shared /tmp/venvs to each agent's home directory: - Venvs: $HOME/.pixell/venvs/ (e.g., /home/agent_xxx/.pixell/venvs/) - Pip cache: $HOME/.cache/pip/ (standard XDG location) ## Changes 1. runtime.py: - Use HOME env var for venvs directory - Pass agent_app_id to PackageLoader for venv isolation 2. loader.py: - Use HOME/.cache/pip for pip cache (XDG standard) - Prevents permission conflicts between agents 3. process_manager.py: - Extract hardcoded /tmp/pixell_packages path to variable ## Benefits ✅ Perfect isolation between agents ✅ No permission conflicts ✅ Follows XDG Base Directory spec ✅ Auto-cleanup when agent user deleted ✅ Works with systemd PrivateTmp ✅ Venvs survive supervisor restarts Closes #7 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…#9) ## Problem When agents are redeployed and the Linux user already exists, the supervisor doesn't verify that the home directory has correct ownership. This causes agents to fail when trying to create files in their home directory. Example error: [Errno 13] Permission denied: '/home/agent_8c82966883524dad_4906eeb7/.pixell' Root cause: - Legacy deployments created home directories owned by root:root - user_manager.create_user() returns early if user exists (line 114-116) - Never checks or repairs home directory ownership ## Solution ### Code Changes Modified src/pixell_runtime/supervisor/user_manager.py: - Added os import for stat() system call - Added ownership verification in create_user() when user exists - Auto-repair with chown -R and chmod 0700 if owned by root (UID 0) - Non-blocking error handling (logs but continues if repair fails) - Added support for short IDs in username generation (bonus feature) ### Testing Created tests/test_supervisor_user_manager.py: - 17 comprehensive unit tests covering: - Ownership already correct (no-op) - Ownership repaired successfully - Error handling when repair fails - Edge cases (home doesn't exist, stat fails, etc.) - Short IDs support - Logging verification - Idempotency - Timeout parameters All tests passing. ## Manual Fix Applied Also executed SSM command to fix vivid-commenter home directory immediately: sudo chown -R agent_8c82966883524dad_4906eeb7:agent_8c82966883524dad_4906eeb7 /home/agent_8c82966883524dad_4906eeb7 sudo chmod 0700 /home/agent_8c82966883524dad_4906eeb7 Verified: drwx------. 6 agent_xxx agent_xxx 60 /home/agent_8c82966883524dad_4906eeb7 Fixes #9 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
## Problem
The deploy script was creating `/etc/par-supervisor.conf` with `PORT=9000`,
but the supervisor code reads `SUPERVISOR_PORT` environment variable.
This caused the supervisor to default to port 9000 instead of 8080,
breaking ALB health checks and making agents appear offline.
## Root Cause
1. Config file set: `PORT=9000` (wrong variable name)
2. Supervisor reads: `os.getenv("SUPERVISOR_PORT", "9000")`
3. Since SUPERVISOR_PORT was not set, defaulted to 9000
4. ALB target group expects port 8080
## Solution
Modified scripts/deploy_ec2_par.sh to:
1. Use sed to replace PORT with SUPERVISOR_PORT in config file
2. Set value to 8080 (not 9000)
3. Updated all hardcoded port references from 9000 to 8080 in:
- Health check endpoints
- Example commands
- Documentation
## Changes
- Line 239: Added sed command to fix environment variable name
- Lines 337-343: Updated health check from port 9000 to 8080
- Line 356: Updated documentation port reference
- Lines 380-386: Updated example commands to use port 8080
- Line 394: Updated test agent deployment example
## Testing
Deployed to i-09dcb7f387166efd0 and verified:
- Config file now has: SUPERVISOR_PORT=8080
- Supervisor listening on: 0.0.0.0:8080
- Health endpoint responding: http://localhost:8080/health
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When agents are redeployed and the Linux user already exists, the supervisor doesn't verify that the home directory has correct ownership. This causes agents to fail when trying to create files in their home directory.
Example error:
Root cause:
root:rootuser_manager.create_user()returns early if user exists (line 114-116)Evidence:
Solution
Part 1 - Manual Fix (Applied ✅)
Executed SSM command to fix vivid-commenter home directory immediately:
Verified:
Part 2 - Code Fix (This PR)
Modified
src/pixell_runtime/supervisor/user_manager.py:osimport forstat()system callcreate_user()when user already existschown -Randchmod 0700if owned by root (UID 0)Key logic:
Part 3 - Testing
Created
tests/test_supervisor_user_manager.py:Test results:
Impact
Immediate:
Long-term:
Testing Checklist
Fixes #9
🤖 Generated with Claude Code