diff --git a/.claude/PRPs/issues/completed/issue-306.md b/.claude/PRPs/issues/completed/issue-306.md new file mode 100644 index 0000000..9c7b927 --- /dev/null +++ b/.claude/PRPs/issues/completed/issue-306.md @@ -0,0 +1,454 @@ +# Investigation: Add terminate button for running workflows on tasks page + +**Issue**: #306 (https://github.com/tbrandenburg/made/issues/306) +**Type**: ENHANCEMENT +**Investigated**: 2026-03-16T12:45:00Z + +### Assessment + +| Metric | Value | Reasoning | +| ---------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Priority | MEDIUM | Issue explicitly states "Medium - quality-of-life improvement that makes existing admin functionality accessible to users through the UI" | +| Complexity | MEDIUM | Found 3 files to modify (backend API, frontend API hook, frontend UI), follows existing patterns for cancel buttons (RepositoryPage:1630, TaskPage:490), requires coordination between backend and frontend | +| Confidence | HIGH | `force_terminate_job()` function already exists and is tested at cron_service.py:436, found clear UI patterns to mirror, API endpoint patterns are well-established, workflow ID format is consistent throughout codebase | + +--- + +## Problem Statement + +Users cannot manually stop long-running or stuck workflows from the UI. The backend already has `force_terminate_job()` function (added in recent cron improvements #297), but there's no API endpoint or UI button to access this functionality. + +--- + +## Analysis + +### Change Rationale + +The cron service improvements added `force_terminate_job()` in `cron_service.py:436-443` as an admin function. However: +1. No API endpoint exists to expose this function to frontend +2. No UI exists in TasksPage to trigger termination for running workflows + +This enhancement follows existing patterns: +- Backend: Similar cancel endpoints in `app.py` (lines 435, 871, 1127) +- Frontend: Danger buttons with confirmation in TaskPage, RepositoryPage + +### Evidence Chain + +ENHANCEMENT: Add UI and API access to existing force_terminate_job() +↓ BECAUSE: Backend function exists but not exposed +Evidence: `cron_service.py:436-443` - `force_terminate_job(workflow_id: str) -> bool` + +↓ BECAUSE: No API endpoint exists +Evidence: `app.py` has cancel endpoints for tasks/repos but none for workflows + +↓ BECAUSE: No UI button exists +Evidence: `TasksPage.tsx:180-226` - workflow table has no action buttons, only displays diagnostics + +↓ SOLUTION: Add API endpoint + UI button following existing patterns +Evidence: Three similar cancel endpoints exist (app.py:435, 871, 1127) + danger button patterns throughout UI + +### Affected Files + +| File | Lines | Action | Description | +| -------------------------------------------- | --------- | ------ | ---------------------------------------- | +| `packages/pybackend/app.py` | ~630 | CREATE | Add POST /api/workflows/{workflow_id}/terminate endpoint | +| `packages/frontend/src/hooks/useApi.ts` | ~481 | UPDATE | Add terminateWorkflow API method | +| `packages/frontend/src/pages/TasksPage.tsx` | ~200,~240 | UPDATE | Add terminate button and handler logic | + +### Integration Points + +- New API endpoint calls `force_terminate_job(workflow_id)` from cron_service +- Frontend button calls new API method on click with confirmation +- UI uses `workflow.diagnostics.running` to show/hide terminate button +- Frontend refreshes workflow list after termination via existing `getWorkspaceWorkflows()` + +### Git History + +- **force_terminate_job introduced**: 6ba299e - "Fix: Improve cron service code quality with thread safety, error handling, and runtime limits (#297)" +- **TasksPage last modified**: f5e91b4 - "Clarify workflow diagnostics stderr and stdout tail" +- **Implication**: Recent addition of function, stable TasksPage, no conflicts expected + +--- + +## Implementation Plan + +### Step 1: Add terminate API endpoint + +**File**: `packages/pybackend/app.py` +**Lines**: After line 630 (after existing cron endpoints) +**Action**: CREATE + +**Required change:** + +```python +@app.post("/api/workflows/{workflow_id}/terminate") +def terminate_workflow(workflow_id: str): + """Terminate a running workflow job.""" + try: + success = force_terminate_job(workflow_id) + if not success: + raise HTTPException( + status_code=status.HTTP_404_NOT_FOUND, + detail="No active workflow process to terminate", + ) + return {"success": True, "message": "Workflow terminated successfully"} + except Exception as e: + logger.exception(f"Error terminating workflow {workflow_id}") + raise HTTPException( + status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, + detail=f"Failed to terminate workflow: {str(e)}", + ) +``` + +**Why**: Exposes existing force_terminate_job function via REST API following established pattern + +--- + +### Step 2: Add force_terminate_job import + +**File**: `packages/pybackend/app.py` +**Lines**: Around line 54-60 (with other cron_service imports) +**Action**: UPDATE + +**Current code:** + +```python +from cron_service import ( + get_cron_job_diagnostics, + get_cron_job_last_runs, + refresh_cron_clock, + start_cron_clock, + stop_cron_clock, +) +``` + +**Required change:** + +```python +from cron_service import ( + force_terminate_job, + get_cron_job_diagnostics, + get_cron_job_last_runs, + refresh_cron_clock, + start_cron_clock, + stop_cron_clock, +) +``` + +**Why**: Need to import the function to use it in the new endpoint + +--- + +### Step 3: Add terminateWorkflow API method + +**File**: `packages/frontend/src/hooks/useApi.ts` +**Lines**: After line 480 (after getWorkspaceWorkflows) +**Action**: UPDATE + +**Current code:** + +```typescript +getWorkspaceWorkflows: () => + request<{ workflows: WorkspaceWorkflowSummary[] }>("/workspace/workflows"), +``` + +**Required change:** + +```typescript +getWorkspaceWorkflows: () => + request<{ workflows: WorkspaceWorkflowSummary[] }>("/workspace/workflows"), +terminateWorkflow: (workflowId: string) => + request<{ success: boolean; message?: string }>(`/workflows/${encodeURIComponent(workflowId)}/terminate`, { + method: "POST", + }), +``` + +**Why**: Frontend needs API method to call the new backend endpoint + +--- + +### Step 4: Add terminate button state management + +**File**: `packages/frontend/src/pages/TasksPage.tsx` +**Lines**: After line 45 (after existing state declarations) +**Action**: UPDATE + +**Current code:** + +```tsx +const [createOpen, setCreateOpen] = useState(false); +``` + +**Required change:** + +```tsx +const [createOpen, setCreateOpen] = useState(false); +const [terminatingWorkflow, setTerminatingWorkflow] = useState(null); +const [terminateModal, setTerminateModal] = useState(false); +const [selectedWorkflow, setSelectedWorkflow] = useState(null); +``` + +**Why**: Need state to track termination in progress and confirmation modal + +--- + +### Step 5: Add terminate handler functions + +**File**: `packages/frontend/src/pages/TasksPage.tsx` +**Lines**: After line 165 (after existing handlers) +**Action**: UPDATE + +**Required change:** + +```tsx +const handleTerminate = (workflow: WorkspaceWorkflowSummary) => { + setSelectedWorkflow(workflow); + setTerminateModal(true); +}; + +const handleConfirmTerminate = async () => { + if (!selectedWorkflow) return; + + const workflowId = `${selectedWorkflow.repository}:${selectedWorkflow.id}`; + setTerminatingWorkflow(workflowId); + setTerminateModal(false); + + try { + await api.terminateWorkflow(workflowId); + await refreshWorkspaceWorkflows(); + } catch (error) { + console.error("Failed to terminate workflow:", error); + alert(`Failed to terminate workflow: ${error instanceof Error ? error.message : "Unknown error"}`); + } finally { + setTerminatingWorkflow(null); + setSelectedWorkflow(null); + } +}; +``` + +**Why**: Handle button click and confirmation with proper error handling and loading states + +--- + +### Step 6: Update table header for Actions column + +**File**: `packages/frontend/src/pages/TasksPage.tsx` +**Lines**: Line 182-189 (table headers) +**Action**: UPDATE + +**Current code:** + +```tsx +Enabled +Schedule +Name +Repository +Last run +Diagnostics +``` + +**Required change:** + +```tsx +Enabled +Schedule +Name +Repository +Last run +Diagnostics +Actions +``` + +**Why**: Need column header for the new terminate button + +--- + +### Step 7: Add terminate button to workflow table + +**File**: `packages/frontend/src/pages/TasksPage.tsx` +**Lines**: Line 215-220 (after diagnostics cell) +**Action**: UPDATE + +**Current code:** + +```tsx +{formatWorkflowLastRun(workflow)} + + {renderWorkflowDiagnosticsSummary(workflow.diagnostics)} + + +``` + +**Required change:** + +```tsx +{formatWorkflowLastRun(workflow)} + + {renderWorkflowDiagnosticsSummary(workflow.diagnostics)} + + + {workflow.diagnostics?.running && ( + + )} + + +``` + +**Why**: Adds terminate button that only shows for running workflows with loading state + +--- + +### Step 8: Add confirmation modal + +**File**: `packages/frontend/src/pages/TasksPage.tsx` +**Lines**: After line 227 (before closing Panel) +**Action**: UPDATE + +**Current code:** + +```tsx + + +``` + +**Required change:** + +```tsx + + + setTerminateModal(false)} + > +

Are you sure you want to terminate this job?

+ {selectedWorkflow && ( +

Workflow: {selectedWorkflow.name}

+ )} +
+ + +
+
+ +``` + +**Why**: Provides confirmation dialog to prevent accidental termination + +--- + +## Patterns to Follow + +### Backend API endpoint pattern from app.py:1127-1134 + +```python +@app.post("/api/tasks/{name}/agent/cancel") +def task_agent_cancel(name: str): + if not cancel_agent_message(f"task:{name}"): + raise HTTPException( + status_code=status.HTTP_404_NOT_FOUND, + detail="No active agent process to cancel", + ) + return {"success": True} +``` + +### Frontend danger button pattern from TaskPage.tsx:490 + +```tsx + +``` + +### Modal confirmation pattern from existing components + +```tsx + +

Confirmation message

+
+ + +
+
+``` + +--- + +## Edge Cases & Risks + +| Risk/Edge Case | Mitigation | +| ------------------------------------- | -------------------------------------------------------------------- | +| Workflow completes before terminate | force_terminate_job returns false, API returns 404 with clear message | +| Network failure during request | Frontend catches error, shows alert, resets state in finally block | +| Double-click on terminate button | Button disabled via terminatingWorkflow state during request | +| Workflow ID encoding issues | Use encodeURIComponent() in API URL construction | +| User cancels confirmation dialog | Modal state properly reset, no API call made | + +--- + +## Validation + +### Automated Checks + +```bash +# Backend type check and import validation +cd packages/pybackend && uv run mypy app.py + +# Backend tests - test existing force_terminate_job function +cd packages/pybackend && uv run python -m pytest tests/unit/test_cron_service.py::test_force_terminate_job -v + +# Frontend type check +cd packages/frontend && npm run type-check + +# Frontend tests for TasksPage +cd packages/frontend && npm run test -- TasksPage --run +``` + +### Manual Verification + +1. Start both servers: `make run` +2. Navigate to Tasks page at `http://localhost:5173/tasks` +3. Verify terminate button appears only next to workflows with `diagnostics.running = true` +4. Click terminate button and verify confirmation dialog appears +5. Cancel dialog and verify no API call made +6. Confirm termination and verify workflow stops running +7. Verify button shows "Terminating..." state during request +8. Test error case with non-existent workflow ID + +--- + +## Scope Boundaries + +**IN SCOPE:** + +- Add terminate button to TasksPage for running workflows only +- Add backend API endpoint calling existing force_terminate_job() +- Add confirmation dialog with clear messaging +- Handle success/error feedback appropriately + +**OUT OF SCOPE (do not touch):** + +- Changes to existing force_terminate_job() function behavior +- Real-time polling or WebSocket updates (use existing refresh pattern) +- Permission/authorization system (no auth requirements in issue) +- Bulk terminate operations for multiple workflows +- Terminate buttons in other parts of UI (only TasksPage requested) + +--- + +## Metadata + +- **Investigated by**: Claude +- **Timestamp**: 2026-03-16T12:45:00Z +- **Artifact**: `.claude/PRPs/issues/issue-306.md` \ No newline at end of file diff --git a/.claude/PRPs/reviews/pr-305-review.md b/.claude/PRPs/reviews/pr-305-review.md new file mode 100644 index 0000000..e91922a --- /dev/null +++ b/.claude/PRPs/reviews/pr-305-review.md @@ -0,0 +1,147 @@ +--- +pr: 305 +title: "Fix: Improve cron service code quality with thread safety, error handling, and runtime limits (#297)" +author: "tbrandenburg" +reviewed: 2026-03-16T07:15:00Z +recommendation: approve +--- + +# PR Review: #305 - Fix: Improve cron service code quality with thread safety, error handling, and runtime limits (#297) + +**Author**: @tbrandenburg +**Branch**: fix/issue-297-cron-service-improvements -> main +**Files Changed**: 4 (+296/-23) + +--- + +## Summary + +Excellent implementation that comprehensively addresses thread safety vulnerabilities, error handling gaps, and runtime limit concerns in the cron service. This PR transforms a basic job scheduler into a production-grade service with robust monitoring, timeout handling, and administrative capabilities. All changes are well-tested, follow established patterns, and maintain backward compatibility. + +--- + +## Implementation Context + +| Artifact | Path | +|----------|------| +| Implementation Report | `.claude/PRPs/issues/completed/issue-297.md` | +| Original Plan | GitHub issue #297 comment | +| Documented Deviations | **0** - Implementation follows plan exactly | + +**Implementation Quality**: The implementation report shows all 8 planned steps were completed successfully with no deviations from the original plan. This indicates excellent planning and execution discipline. + +--- + +## Changes Overview + +| File | Changes | Assessment | +|------|---------|------------| +| `packages/pybackend/cron_service.py` | +124/-10 | **EXCELLENT** - Thread safety, timeout monitoring, admin functions | +| `packages/pybackend/workflow_service.py` | +5/-0 | **PASS** - Clean schema extension for runtime limits | +| `packages/pybackend/tests/unit/test_cron_service.py` | +114/-13 | **EXCELLENT** - Comprehensive test coverage for new features | +| `.claude/PRPs/issues/completed/issue-297.md` | +53/-0 | **PASS** - Implementation tracking artifact | + +--- + +## Issues Found + +### Critical +**No critical issues found.** + +### High Priority +**No high priority issues found.** + +### Medium Priority +**No medium priority issues found.** + +### Suggestions + +- **`cron_service.py:276`** - Consider making timeout monitor interval configurable + - **Why**: Currently hardcoded to 1 minute, may want different intervals for different deployments + - **Fix**: Add `TIMEOUT_MONITOR_INTERVAL_MINUTES` configuration option + +- **`cron_service.py:436-464`** - Add docstrings for new admin functions + - **Why**: New public functions lack documentation about parameters and return values + - **Fix**: Add JSDoc-style docstrings for `force_terminate_job()` and `get_long_running_jobs()` + +- **`cron_service.py:95`** - Consider adding metrics for timeout terminations + - **Why**: Production monitoring would benefit from timeout event counting + - **Fix**: Add `_timeout_terminated_jobs` counter similar to existing counters + +--- + +## Validation Results + +| Check | Status | Details | +|-------|--------|---------| +| Lint | **PASS** | All checks passed with ruff | +| Format | **PASS** | 2 files already formatted correctly | +| Tests | **PASS** | 14/14 tests passing | +| Import | **PASS** | All modules import successfully | + +--- + +## Pattern Compliance + +- [x] Follows existing code structure +- [x] Type safety maintained (proper type hints throughout) +- [x] Naming conventions followed +- [x] Tests added for all new functionality +- [x] Backward compatibility preserved +- [x] Logging patterns consistent +- [x] Error handling comprehensive + +--- + +## Security Analysis + +**No security concerns identified.** The implementation properly: +- Uses process termination escalation (terminate → kill) +- Implements runtime limits to prevent resource exhaustion +- Provides admin functions with appropriate scoping +- Handles zombie processes gracefully +- No user input without validation + +--- + +## Thread Safety Review + +**Excellent thread safety implementation:** +- **`cron_service.py:39-68`**: Proper locking in `_terminate_running_job_unlocked()` with separation of locked/unlocked variants +- **`cron_service.py:84-100`**: Timeout monitor correctly acquires lock before shared state access +- **`cron_service.py:164-176`**: Job start tracking properly synchronized +- **`cron_service.py:310-312`**: Scheduler shutdown waits for completion before lock acquisition to prevent deadlock + +--- + +## What's Good + +- **Robust Error Handling**: Nested exception handling gracefully manages timeout scenarios and zombie processes (lines 50-58) +- **Production-Grade Features**: Runtime limits, monitoring, and administrative controls make this production-ready +- **Excellent Test Coverage**: 14 comprehensive unit tests cover edge cases, failure scenarios, and admin functions +- **Thread Safety**: Proper use of locks with careful consideration of deadlock prevention +- **Backward Compatibility**: All existing functionality preserved, new features are additive +- **Code Quality**: Clean separation of concerns, consistent logging, proper resource cleanup +- **Documentation**: Clear implementation report showing planned vs. actual work + +--- + +## Recommendation + +**APPROVE** ✅ + +This PR successfully addresses all production reliability concerns identified in #297. The implementation is production-grade with: + +- ✅ **Thread safety** - Proper locking prevents race conditions +- ✅ **Error handling** - Robust timeout and zombie process management +- ✅ **Runtime monitoring** - Configurable limits with proactive termination +- ✅ **Admin capabilities** - Manual termination and long-running job identification +- ✅ **Test coverage** - Comprehensive tests for all new functionality +- ✅ **Code quality** - Follows project patterns and maintains backward compatibility + +**Ready for merge.** The suggestions are non-blocking improvements for future iterations. + +--- + +*Reviewed by Claude* +*Report: `.claude/PRPs/reviews/pr-305-review.md`* \ No newline at end of file diff --git a/packages/frontend/src/hooks/useApi.ts b/packages/frontend/src/hooks/useApi.ts index 1038ce4..ad754ac 100644 --- a/packages/frontend/src/hooks/useApi.ts +++ b/packages/frontend/src/hooks/useApi.ts @@ -478,6 +478,10 @@ export const api = { request<{ workflows: WorkflowDefinition[] }>("/workflows"), getWorkspaceWorkflows: () => request<{ workflows: WorkspaceWorkflowSummary[] }>("/workspace/workflows"), + terminateWorkflow: (workflowId: string) => + request<{ success: boolean; message?: string }>(`/workflows/${encodeURIComponent(workflowId)}/terminate`, { + method: "POST", + }), saveWorkflows: (workflows: WorkflowDefinition[]) => request<{ workflows: WorkflowDefinition[] }>("/workflows", { method: "PUT", diff --git a/packages/frontend/src/pages/TasksPage.tsx b/packages/frontend/src/pages/TasksPage.tsx index aac18b4..5e6448e 100644 --- a/packages/frontend/src/pages/TasksPage.tsx +++ b/packages/frontend/src/pages/TasksPage.tsx @@ -110,6 +110,9 @@ export const TasksPage: React.FC = () => { WorkspaceWorkflowSummary[] >([]); const [createOpen, setCreateOpen] = useState(false); + const [terminatingWorkflow, setTerminatingWorkflow] = useState(null); + const [terminateModal, setTerminateModal] = useState(false); + const [selectedWorkflow, setSelectedWorkflow] = useState(null); const [newName, setNewName] = useState(""); const navigate = useNavigate(); const isTemplate = (task: ArtefactSummary) => { @@ -160,6 +163,32 @@ export const TasksPage: React.FC = () => { navigate(`/tasks/${filename}`); }; + const handleTerminate = (workflow: WorkspaceWorkflowSummary) => { + setSelectedWorkflow(workflow); + setTerminateModal(true); + }; + + const handleConfirmTerminate = async () => { + if (!selectedWorkflow) return; + + const workflowId = `${selectedWorkflow.repository}:${selectedWorkflow.id}`; + setTerminatingWorkflow(workflowId); + setTerminateModal(false); + + try { + await api.terminateWorkflow(workflowId); + // Refresh workflow list + const res = await api.getWorkspaceWorkflows(); + setWorkspaceWorkflows(res.workflows); + } catch (error) { + console.error("Failed to terminate workflow:", error); + alert(`Failed to terminate workflow: ${error instanceof Error ? error.message : "Unknown error"}`); + } finally { + setTerminatingWorkflow(null); + setSelectedWorkflow(null); + } + }; + return (

Tasks

@@ -186,6 +215,7 @@ export const TasksPage: React.FC = () => { Repository Last run Diagnostics + Actions @@ -218,6 +248,18 @@ export const TasksPage: React.FC = () => { workflow.diagnostics, )} + + {workflow.diagnostics?.running && ( + + )} + ); })} @@ -324,6 +366,25 @@ export const TasksPage: React.FC = () => {
+ + setTerminateModal(false)} + > +

Are you sure you want to terminate this job?

+ {selectedWorkflow && ( +

Workflow: {selectedWorkflow.name}

+ )} +
+ + +
+
); }; diff --git a/packages/pybackend/app.py b/packages/pybackend/app.py index 1699a03..84ecfa8 100644 --- a/packages/pybackend/app.py +++ b/packages/pybackend/app.py @@ -52,6 +52,7 @@ from harness_service import is_process_running, list_harnesses, run_harness from workflow_service import list_workspace_workflows, read_workflows, write_workflows from cron_service import ( + force_terminate_job, get_cron_job_diagnostics, get_cron_job_last_runs, refresh_cron_clock, @@ -99,6 +100,7 @@ ) logger = logging.getLogger("made.pybackend") + @contextlib.asynccontextmanager async def lifespan(_: FastAPI): start_cron_clock() @@ -542,8 +544,6 @@ def repository_git_worktree_delete(name: str): raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail=str(exc)) - - @app.get("/api/repositories/{name}/workflows") def repository_workflows(name: str): try: @@ -612,6 +612,26 @@ def workspace_workflows(): status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail=str(exc) ) + +@app.post("/api/workflows/{workflow_id}/terminate") +def terminate_workflow(workflow_id: str): + """Terminate a running workflow job.""" + try: + success = force_terminate_job(workflow_id) + if not success: + raise HTTPException( + status_code=status.HTTP_404_NOT_FOUND, + detail="No active workflow process to terminate", + ) + return {"success": True, "message": "Workflow terminated successfully"} + except Exception as e: + logger.exception(f"Error terminating workflow {workflow_id}") + raise HTTPException( + status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, + detail=f"Failed to terminate workflow: {str(e)}", + ) + + @app.post("/api/cron/update") def update_cron_jobs(): try: @@ -623,6 +643,7 @@ def update_cron_jobs(): status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail=str(exc) ) + @app.get("/api/commands") def global_commands(): try: