feat: add auto-recovery monitor for self-healing task execution

ourines · claude · ourines · commit d0c8477d6e93 · 2025-12-20T14:29:39.000+08:00
- Add monitor.py: background process that detects errors (rate limit, API error, timeout) and auto-resumes - Update launch.py: automatically starts monitor (use --no-monitor to disable) - Update cleanup.py: kills monitor process on cleanup - Update task-prompt-template.md: guide Claude to write checkpoints in .worktree-task/progress.json - Fix on-session-start.py: handle installed_plugins.json array format - Add CLAUDE.md for project documentation - Bump version to 1.1.0 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
diff --git a/.claude-plugin/plugin.json b/.claude-plugin/plugin.json
@@ -1,6 +1,6 @@
 {
   "name": "worktree-task",
-  "version": "1.0.0",
+  "version": "1.2.0",
   "description": "Manage large coding tasks using git worktrees and background Claude Code sessions. Supports launching, monitoring, resuming, and cleanup of autonomous tasks with alert notifications.",
   "author": {
     "name": "ourines"
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,83 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+A Claude Code plugin that enables spawning autonomous agent sessions in separate git worktrees via tmux. Manages large coding tasks that would otherwise overflow context by delegating phases to subagents.
+
+## Architecture
+
+```
+├── .claude-plugin/plugin.json    # Plugin manifest (name, version, repo)
+├── skills/worktree-task/SKILL.md # Auto-trigger skill definition
+├── commands/*.md                 # Slash command definitions (/worktree-task:*)
+├── hooks/
+│   ├── hooks.json                # Hook registrations (SessionStart, Stop, SessionEnd)
+│   └── handlers/*.py             # Hook implementations
+├── scripts/*.py                  # Core Python scripts for task management
+└── references/*.md               # Prompt templates for spawned agents
+```
+
+### Core Flow
+
+1. **launch.py** - Creates worktree + tmux session, launches agent (Claude Code or Codex), sends task prompt
+2. **status.py** - Captures tmux pane output, shows git info from worktree
+3. **resume.py** - Detects error type (rate_limit, api_error, timeout), sends recovery message
+4. **cleanup.py** - Kills tmux session, optionally removes worktree
+5. **merge.py / rebase.py** - Spawns Claude in tmux to auto-resolve git conflicts
+
+### Key Design Patterns
+
+- **tmux as state**: Sessions track running tasks, pane capture provides status
+- **Worktrees for isolation**: Each task gets its own working directory
+- **Template substitution**: `$TASK_DESCRIPTION`, `$WORKTREE_DIR` in `references/*.md`
+- **Hook-based updates**: `on-session-start.py` checks GitHub Releases API for updates (24h cache)
+
+## Common Commands
+
+```bash
+# Test a script directly
+python3 scripts/launch.py feature/test "Test task description"
+python3 scripts/status.py                    # List all sessions
+python3 scripts/status.py <session-name>     # Detailed session status
+python3 scripts/resume.py <session> --check  # Check without sending message
+python3 scripts/cleanup.py <session> --remove-worktree
+
+# Manual tmux operations
+tmux list-sessions
+tmux attach -t <session>
+tmux kill-session -t <session>
+
+# Git worktree operations
+git worktree list
+git worktree remove <path>
+```
+
+## Agent Command Options
+
+The plugin supports multiple agent backends via launch.py:
+
+| Flag | Command |
+|------|---------|
+| (default) | `claude --dangerously-skip-permissions` |
+| `--codex` | `codex --yolo -m gpt-5.1-codex-max -c model_reasoning_effort="high"` |
+| `--agent-cmd "<cmd>"` | Custom command |
+| `--env KEY=VALUE` | Add environment variables |
+
+## Conventions
+
+- Session names: `<project>-<branch>` with `/` and `.` replaced by `-`
+- Worktree paths: `../<project>-<branch-safe>` (parent directory)
+- Temp files: `/tmp/claude_task_prompt.txt`, `/tmp/claude_merge_prompt.txt`
+- Cache: `~/.claude/plugins/.worktree-task-update-cache.json`
+
+## Hook System
+
+Hooks are defined in `hooks/hooks.json` and triggered by Claude Code events:
+
+- **SessionStart** (startup): Checks for plugin updates via GitHub API
+- **Stop**: Task completion notification (macOS `osascript`)
+- **SessionEnd**: Session end handler
+
+Hook handlers receive JSON on stdin and output JSON with optional `systemMessage`.
diff --git a/hooks/handlers/on-session-start.py b/hooks/handlers/on-session-start.py
@@ -100,8 +100,19 @@ def get_installed_plugin_info() -> dict:
         with open(plugins_file, "r") as f:
             data = json.load(f)
         
-        plugin_info = data.get("plugins", {}).get(PLUGIN_ID, {})
-        if plugin_info:
+        plugin_list = data.get("plugins", {}).get(PLUGIN_ID, [])
+        # plugin_list is an array (may have multiple scopes: user, local, etc.)
+        # Prefer "user" scope, fallback to first entry
+        plugin_info = None
+        if isinstance(plugin_list, list) and plugin_list:
+            for entry in plugin_list:
+                if entry.get("scope") == "user":
+                    plugin_info = entry
+                    break
+            if not plugin_info:
+                plugin_info = plugin_list[0]
+
+        if plugin_info and isinstance(plugin_info, dict):
             result["version"] = plugin_info.get("version", "")
             result["gitCommitSha"] = plugin_info.get("gitCommitSha", "")
             result["installPath"] = plugin_info.get("installPath", "")
diff --git a/references/task-prompt-template.md b/references/task-prompt-template.md
@@ -44,6 +44,39 @@ Task tool:
 4. **Update TodoWrite** - Mark completed, add discovered tasks
 5. **Verify at end** - Build passes, tests pass
 
+## Checkpoint System
+
+A monitor process is watching this session and will auto-recover from errors (rate limits, API errors, etc.).
+To help recovery, maintain checkpoint state in `.worktree-task/progress.json`:
+
+### Update Checkpoint After Each Phase
+
+```python
+# Read current state
+import json
+from pathlib import Path
+
+progress_file = Path(".worktree-task/progress.json")
+progress = json.loads(progress_file.read_text()) if progress_file.exists() else {}
+
+# Update after completing a phase
+progress["current_phase"] = "phase-3"
+progress["completed_phases"] = ["phase-1", "phase-2", "phase-3"]
+progress["next_action"] = "Implement database models"
+progress["last_commit"] = "abc1234"
+
+progress_file.parent.mkdir(exist_ok=True)
+progress_file.write_text(json.dumps(progress, indent=2))
+```
+
+### Recovery Protocol
+
+If you see a message like "Continue from where you left off":
+1. Read `.worktree-task/progress.json` to understand current state
+2. Check your TodoWrite for remaining tasks
+3. Resume from `next_action` or the first incomplete todo
+4. Do NOT restart from the beginning
+
 ## Execution Mode
 
 - **SILENT MODE**: You are in a worktree, user has pre-approved ALL operations
@@ -59,18 +92,30 @@ Use TodoWrite throughout:
 - Mark `completed` immediately when done (never batch)
 - Only ONE task should be `in_progress` at a time
 
+## Completion Signal
+
+When ALL tasks are done, output exactly:
+
+```
+ALL TASKS COMPLETED
+```
+
+This signals the monitor to stop and send completion notification.
+
 ## Example Workflow
 
 ```
 1. Read design docs / specs
 2. TodoWrite: Create 8 phase todos
-3. TodoWrite: Mark phase 1 as in_progress
-4. Task tool: Execute phase 1 (project setup)
-5. TodoWrite: Mark phase 1 completed, phase 2 in_progress
-6. Task tool: Execute phase 2 (database)
-7. ... continue for all phases ...
-8. Final verification
-9. Commit and report completion
+3. Update .worktree-task/progress.json with plan
+4. TodoWrite: Mark phase 1 as in_progress
+5. Task tool: Execute phase 1 (project setup)
+6. Update progress.json: completed_phases += "phase-1"
+7. TodoWrite: Mark phase 1 completed, phase 2 in_progress
+8. Task tool: Execute phase 2 (database)
+9. ... continue for all phases ...
+10. Final verification (build, tests)
+11. Output: ALL TASKS COMPLETED
 ```
 
 ## Begin Now
diff --git a/scripts/cleanup.py b/scripts/cleanup.py
@@ -28,6 +28,18 @@ def get_git_root() -> Path:
     return Path(result.stdout.strip())
 
 
+def kill_monitor_process(session_name: str, worktree_dir: Path):
+    """Kill any monitor process associated with this session."""
+    # Find and kill monitor process by matching command line
+    result = run(f"pgrep -f 'monitor.py {session_name}'", check=False, capture=True)
+    if result.returncode == 0 and result.stdout.strip():
+        pids = result.stdout.strip().split('\n')
+        for pid in pids:
+            if pid:
+                run(f"kill {pid}", check=False)
+                print(f"  ✓ Killed monitor process (PID: {pid})")
+
+
 def main():
     if len(sys.argv) < 2:
         print("Usage: cleanup.py <session-name> [--remove-worktree]")
@@ -46,6 +58,16 @@ def main():
     print("=== Worktree Task Cleanup ===")
     print()
 
+    # Kill monitor process first
+    try:
+        project_dir = get_git_root()
+        project_name = project_dir.name
+        worktree_dir = project_dir.parent / f"{project_name}-{session_name}"
+        print("Killing monitor process...")
+        kill_monitor_process(session_name, worktree_dir)
+    except subprocess.CalledProcessError:
+        worktree_dir = None
+
     # Kill tmux session
     print(f"Killing tmux session: {session_name}")
     if session_exists(session_name):
@@ -55,14 +77,15 @@ def main():
         print("  ⚠ Session not found (may already be closed)")
     print()
 
-    # Find worktree path
-    try:
-        project_dir = get_git_root()
-        project_name = project_dir.name
-        worktree_dir = project_dir.parent / f"{project_name}-{session_name}"
-    except subprocess.CalledProcessError:
-        print("Warning: Not in a git repository, cannot determine worktree path")
-        worktree_dir = None
+    # Find worktree path (if not already set)
+    if worktree_dir is None:
+        try:
+            project_dir = get_git_root()
+            project_name = project_dir.name
+            worktree_dir = project_dir.parent / f"{project_name}-{session_name}"
+        except subprocess.CalledProcessError:
+            print("Warning: Not in a git repository, cannot determine worktree path")
+            worktree_dir = None
 
     # Show worktrees
     print("=== Git Worktrees ===")
diff --git a/scripts/launch.py b/scripts/launch.py
@@ -98,13 +98,34 @@ def load_task_template(script_dir: Path, task_desc: str, worktree_dir: str) -> s
     return template
 
 
+def start_monitor(session_name: str, worktree_dir: Path, script_dir: Path):
+    """Start the monitor process in the background."""
+    monitor_script = script_dir / "monitor.py"
+    log_file = worktree_dir / ".worktree-task" / "monitor.log"
+
+    # Ensure log directory exists
+    log_file.parent.mkdir(exist_ok=True)
+
+    # Start monitor as a background process
+    cmd = f"nohup python3 \"{monitor_script}\" {session_name} \"{worktree_dir}\" > \"{log_file}\" 2>&1 &"
+    subprocess.Popen(cmd, shell=True, start_new_session=True)
+
+
 def main():
     if len(sys.argv) < 3:
-        print("Usage: launch.py <branch-name> \"<task-description>\" [--env KEY=VALUE ...] [--agent-cmd \"<agent command>\"] [--claude] [--codex]")
-        print("Example: launch.py feature/my-task \"Implement the new feature\"")
-        print("Example: launch.py feature/my-task \"Task\" --env ANTHROPIC_BASE_URL=http://api.codex.markets")
-        print("Example: launch.py feature/my-task \"Task\" --agent-cmd \"codex --yolo -m gpt-5.1-codex-max -c model_reasoning_effort=\\\"high\\\"\"")
-        print("Example: launch.py feature/my-task \"Task\" --codex")
+        print("Usage: launch.py <branch-name> \"<task-description>\" [options]")
+        print()
+        print("Options:")
+        print("  --env KEY=VALUE      Set environment variable")
+        print("  --agent-cmd \"cmd\"    Custom agent command")
+        print("  --claude             Use Claude Code (default)")
+        print("  --codex              Use Codex CLI")
+        print("  --no-monitor         Disable auto-recovery monitor")
+        print()
+        print("Examples:")
+        print("  launch.py feature/my-task \"Implement the new feature\"")
+        print("  launch.py feature/my-task \"Task\" --codex")
+        print("  launch.py feature/my-task \"Task\" --no-monitor")
         sys.exit(1)
 
     branch_name = sys.argv[1]
@@ -114,6 +135,7 @@ def main():
     # Parse custom environment variables and agent command override
     custom_env = {}
     agent_cmd_override = None
+    enable_monitor = True  # Default: monitor enabled
     i = 3
     while i < len(sys.argv):
         if sys.argv[i] == "--env" and i + 1 < len(sys.argv):
@@ -131,6 +153,9 @@ def main():
         elif sys.argv[i] == "--claude":
             agent_cmd_override = DEFAULT_AGENT_CMD
             i += 1
+        elif sys.argv[i] == "--no-monitor":
+            enable_monitor = False
+            i += 1
         else:
             i += 1
 
@@ -235,10 +260,18 @@ def main():
     # Cleanup temp file
     temp_file.unlink()
 
+    # Start monitor process (default: enabled)
+    if enable_monitor:
+        print("Starting auto-recovery monitor...")
+        start_monitor(session_name, worktree_dir, script_dir)
+        print("  ✓ Monitor running in background")
+
     print()
     print("=== Task Launched Successfully ===")
     print()
-    print(f"Monitor:  {script_dir}/status.py {session_name}")
+    if enable_monitor:
+        print(f"Monitor:  tail -f \"{worktree_dir}/.worktree-task/monitor.log\"")
+    print(f"Status:   {script_dir}/status.py {session_name}")
     print(f"Attach:   tmux attach -t {session_name}")
     print(f"Kill:     tmux kill-session -t {session_name}")
     print(f"Cleanup:  {script_dir}/cleanup.py {session_name} --remove-worktree")
diff --git a/scripts/monitor.py b/scripts/monitor.py

Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,6 @@`
`1`	`1`	`{`
`2`	`2`	`"name": "worktree-task",`
`3`		`- "version": "1.0.0",`
	`3`	`+ "version": "1.2.0",`
`4`	`4`	`"description": "Manage large coding tasks using git worktrees and background Claude Code sessions. Supports launching, monitoring, resuming, and cleanup of autonomous tasks with alert notifications.",`
`5`	`5`	`"author": {`
`6`	`6`	`"name": "ourines"`