Skip to content

fix: Verify and repair home directory ownership in user_manager (issue #9)#10

Open
syumpx wants to merge 5 commits intomainfrom
fix/issue-9-home-directory-ownership
Open

fix: Verify and repair home directory ownership in user_manager (issue #9)#10
syumpx wants to merge 5 commits intomainfrom
fix/issue-9-home-directory-ownership

Conversation

@syumpx
Copy link
Copy Markdown
Member

@syumpx syumpx commented Oct 22, 2025

Problem

When agents are redeployed and the Linux user already exists, the supervisor doesn't verify that the home directory has correct ownership. This causes agents to fail when trying to create files in their home directory.

Example error:

[Errno 13] Permission denied: '/home/agent_8c82966883524dad_4906eeb7/.pixell'

Root cause:

  • Legacy deployments created home directories owned by root:root
  • user_manager.create_user() returns early if user exists (line 114-116)
  • Never checks or repairs home directory ownership

Evidence:

# Broken (vivid-commenter):
drwxr-xr-x. 6 root root 60 /home/agent_8c82966883524dad_4906eeb7

# Working (PAF-Core):
drwx------. 7 agent_xxx agent_xxx 131 /home/agent_8c82966883524dad_5pwbelmv

Solution

Part 1 - Manual Fix (Applied ✅)

Executed SSM command to fix vivid-commenter home directory immediately:

sudo chown -R agent_8c82966883524dad_4906eeb7:agent_8c82966883524dad_4906eeb7 /home/agent_8c82966883524dad_4906eeb7
sudo chmod 0700 /home/agent_8c82966883524dad_4906eeb7

Verified:

drwx------. 6 agent_8c82966883524dad_4906eeb7 agent_8c82966883524dad_4906eeb7 60 /home/agent_8c82966883524dad_4906eeb7

Part 2 - Code Fix (This PR)

Modified src/pixell_runtime/supervisor/user_manager.py:

  • Added os import for stat() system call
  • Added ownership verification in create_user() when user already exists
  • Auto-repair with chown -R and chmod 0700 if owned by root (UID 0)
  • Non-blocking error handling (logs but continues if repair fails)
  • Added support for short IDs in username generation (bonus feature)

Key logic:

if self.user_exists(agent_app_id):
    logger.info("User already exists", username=username)
    
    # NEW: Verify and repair home directory ownership
    if home_dir.exists():
        stat_info = home_dir.stat()
        if stat_info.st_uid == 0:  # Owned by root
            # Auto-repair with chown + chmod
            logger.warning("Home directory owned by root, repairing...")
            subprocess.run(["chown", "-R", f"{username}:{username}", str(home_dir)])
            subprocess.run(["chmod", "0700", str(home_dir)])
    
    return home_dir

Part 3 - Testing

Created tests/test_supervisor_user_manager.py:

  • 17 comprehensive unit tests covering:
    • ✅ Ownership already correct (no-op)
    • ✅ Ownership repaired successfully
    • ✅ Error handling when repair fails
    • ✅ Edge cases (home doesn't exist, stat fails, etc.)
    • ✅ Short IDs support
    • ✅ Logging verification
    • ✅ Idempotency
    • ✅ Timeout parameters (chown: 30s, chmod: 5s)

Test results:

17 passed in 0.18s

Impact

Immediate:

  • ✅ Fixes vivid-commenter agent (currently unavailable)

Long-term:

  • ✅ Self-healing for existing broken agents
  • ✅ Prevents future ownership issues from legacy deployments
  • ✅ Non-blocking error handling (won't break deployments)

Testing Checklist

  • Unit tests pass (17/17)
  • Manual SSM fix verified on EC2
  • Zombie process tests still pass (20/20)
  • Deploy to EC2 and verify auto-fix works
  • Redeploy vivid-commenter and verify it starts successfully

Fixes #9

🤖 Generated with Claude Code

syumpx and others added 4 commits October 20, 2025 10:58
Fixes permission denied errors when agents try to extract packages to
/tmp/pixell_packages/. The issue occurred because:
1. The shared directory didn't exist or had restrictive permissions
2. Extracted package directories were owned by the wrong user

Changes:
- SupervisorState: Initialize /tmp/pixell_packages with 1777 permissions
  at startup, allowing all agent users to create subdirectories
- ProcessManager: Fix ownership of existing extracted packages before
  spawning agents to ensure agent user can read them
- PackageLoader: Improve error messaging for permission errors with
  actionable guidance

Tests:
- Added 7 new tests covering shared directory initialization and
  package ownership fixing
- All 48 tests passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit resolves three related issues with zombie process handling
in the supervisor:

Issue #4 - Zombie Process Reaping:
- Added background task _reap_zombies_task() that runs every 5 seconds
- Calls os.waitpid(-1, os.WNOHANG) to reap all zombie processes
- Updates agent status to FAILED when zombie detected
- Logs exit codes and signals for debugging

Issue #5 - False Health Status:
- Added is_process_zombie() using psutil for cross-platform detection
- Added get_process_health() returning comprehensive process metrics
- Updated is_running() to exclude zombies (returns False for zombies)
- Updated GET /agents/:id endpoint to report real-time zombie status

Issue #6 - Zombie Cleanup in DELETE/DEPLOY:
- Added _cleanup_process_manager_state() helper method (idempotent)
- Updated delete() to force-clean zombies from process_manager state
- Updated deploy() to auto-detect and cleanup dead/zombie agents
- Enables transparent recovery without PAC awareness

Key Features:
- Cross-platform zombie detection using psutil
- Idempotent cleanup safe to call multiple times
- Auto-recovery in DEPLOY operation
- Comprehensive test coverage (35 tests total)

Tests Added:
- tests/test_supervisor_zombie_reaping.py (10 tests)
- tests/test_supervisor_zombie_health.py (18 tests, 3 skipped)
- tests/test_supervisor_zombie_cleanup.py (10 tests)
- tests/test_supervisor_server.py (added 3 endpoint tests)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…ts (issue #7)

This fixes the permission denied error when multiple agents try to create
venvs in the shared /tmp/venvs directory.

## Problem

- Supervisor uses PrivateTmp=true, creating isolated /tmp namespace
- First agent creates /tmp/venvs/ owned by its user (agent_xxx)
- Second agent fails to create venvs due to ownership mismatch
- python3.11 -m venv fails with "Permission denied"

## Solution

Move venvs from shared /tmp/venvs to each agent's home directory:
- Venvs: $HOME/.pixell/venvs/ (e.g., /home/agent_xxx/.pixell/venvs/)
- Pip cache: $HOME/.cache/pip/ (standard XDG location)

## Changes

1. runtime.py:
   - Use HOME env var for venvs directory
   - Pass agent_app_id to PackageLoader for venv isolation

2. loader.py:
   - Use HOME/.cache/pip for pip cache (XDG standard)
   - Prevents permission conflicts between agents

3. process_manager.py:
   - Extract hardcoded /tmp/pixell_packages path to variable

## Benefits

✅ Perfect isolation between agents
✅ No permission conflicts
✅ Follows XDG Base Directory spec
✅ Auto-cleanup when agent user deleted
✅ Works with systemd PrivateTmp
✅ Venvs survive supervisor restarts

Closes #7

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…#9)

## Problem
When agents are redeployed and the Linux user already exists, the supervisor
doesn't verify that the home directory has correct ownership. This causes
agents to fail when trying to create files in their home directory.

Example error:
[Errno 13] Permission denied: '/home/agent_8c82966883524dad_4906eeb7/.pixell'

Root cause:
- Legacy deployments created home directories owned by root:root
- user_manager.create_user() returns early if user exists (line 114-116)
- Never checks or repairs home directory ownership

## Solution

### Code Changes
Modified src/pixell_runtime/supervisor/user_manager.py:
- Added os import for stat() system call
- Added ownership verification in create_user() when user exists
- Auto-repair with chown -R and chmod 0700 if owned by root (UID 0)
- Non-blocking error handling (logs but continues if repair fails)
- Added support for short IDs in username generation (bonus feature)

### Testing
Created tests/test_supervisor_user_manager.py:
- 17 comprehensive unit tests covering:
  - Ownership already correct (no-op)
  - Ownership repaired successfully
  - Error handling when repair fails
  - Edge cases (home doesn't exist, stat fails, etc.)
  - Short IDs support
  - Logging verification
  - Idempotency
  - Timeout parameters

All tests passing.

## Manual Fix Applied
Also executed SSM command to fix vivid-commenter home directory immediately:
sudo chown -R agent_8c82966883524dad_4906eeb7:agent_8c82966883524dad_4906eeb7 /home/agent_8c82966883524dad_4906eeb7
sudo chmod 0700 /home/agent_8c82966883524dad_4906eeb7

Verified: drwx------. 6 agent_xxx agent_xxx 60 /home/agent_8c82966883524dad_4906eeb7

Fixes #9

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
## Problem
The deploy script was creating `/etc/par-supervisor.conf` with `PORT=9000`,
but the supervisor code reads `SUPERVISOR_PORT` environment variable.
This caused the supervisor to default to port 9000 instead of 8080,
breaking ALB health checks and making agents appear offline.

## Root Cause
1. Config file set: `PORT=9000` (wrong variable name)
2. Supervisor reads: `os.getenv("SUPERVISOR_PORT", "9000")`
3. Since SUPERVISOR_PORT was not set, defaulted to 9000
4. ALB target group expects port 8080

## Solution
Modified scripts/deploy_ec2_par.sh to:
1. Use sed to replace PORT with SUPERVISOR_PORT in config file
2. Set value to 8080 (not 9000)
3. Updated all hardcoded port references from 9000 to 8080 in:
   - Health check endpoints
   - Example commands
   - Documentation

## Changes
- Line 239: Added sed command to fix environment variable name
- Lines 337-343: Updated health check from port 9000 to 8080
- Line 356: Updated documentation port reference
- Lines 380-386: Updated example commands to use port 8080
- Line 394: Updated test agent deployment example

## Testing
Deployed to i-09dcb7f387166efd0 and verified:
- Config file now has: SUPERVISOR_PORT=8080
- Supervisor listening on: 0.0.0.0:8080
- Health endpoint responding: http://localhost:8080/health

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: Verify and repair home directory ownership in user_manager

1 participant