Skip to content

Commit b441eec

Browse files
authored
iris: live resource monitoring in dashboard and heartbeat (#3085)
## Summary - Add host-level resource collection (CPU%, memory, disk) to worker heartbeats via new `HostMetricsCollector` using `/proc/stat`, `/proc/meminfo`, and `shutil.disk_usage` - Store resource snapshots in controller state (60-entry ring buffer per worker) and expose via `GetWorkerStatus` RPC - Overhaul dashboard frontend with CSS design tokens, shared gauge components (`MetricCard`, `Gauge`, `InlineGauge`, `ResourceSection`), and live resource visualizations across all dashboard pages - Update `AGENTS.md` with comprehensive dashboard design system documentation Addresses #3069 ## Changes **Backend (proto + Python):** - `WorkerResourceSnapshot` proto message with CPU%, memory used/total, disk used/total, running task count - `HostMetricsCollector` class in `env_probe.py` — delta-based CPU measurement, graceful fallback on non-Linux - Heartbeat response includes resource snapshot; controller stores history ring buffer - `GetWorkerStatus` RPC returns `current_resources` and `resource_history` **Frontend (JS + CSS):** - Design tokens in `:root` CSS variables (colors, typography, radius, shadows) - Shared components: `MetricCard`, `Gauge`, `InlineGauge`, `ResourceSection` - Worker detail: live CPU/memory/disk gauge bars from heartbeat data, metric cards - Task detail: card layout with resource gauges, auto-refresh for active tasks - Worker dashboard: aggregate resource summary, auto-refresh - Controller: cluster summary bar, auto-refresh - Fleet tab: styled CPU/memory/task columns **Tests:** - New E2E test `test_worker_detail_metric_cards` - All 700 existing tests pass (1 pre-existing CoreWeave live infra failure unrelated) ## Test plan - [x] `./infra/pre-commit.py --all-files --fix` passes - [x] `uv run pytest lib/iris/tests/e2e/test_dashboard.py -x -o "addopts="` — 8/8 pass - [x] `uv run pytest lib/iris/tests/ -x -o "addopts=" --timeout=120` — 699/700 pass (1 pre-existing CoreWeave env failure) - [ ] Manual review of dashboard with a live cluster to verify gauge rendering 🤖 Generated with [Claude Code](https://claude.com/claude-code)
1 parent ed85f54 commit b441eec

File tree

27 files changed

+1891
-411
lines changed

27 files changed

+1891
-411
lines changed

lib/iris/AGENTS.md

Lines changed: 75 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ When adding new modules or significant features:
101101
| Autoscaler Design | docs/autoscaler-v0-design.md | Technical specification, threading model |
102102
| Thread Safety | docs/thread-safety.md | Thread management, test synchronization best practices |
103103
| Original Design | docs/fray-zero.md | Rationale and design decisions |
104+
| Task States | docs/task-states.md | Task state machine, transitions, retry semantics, dashboard display |
104105
| CoreWeave Integration | (below) | Platform, runtime, and networking for CoreWeave bare metal |
105106

106107
### CoreWeave Integration
@@ -285,32 +286,97 @@ The controller and worker dashboards are client-side SPAs using Preact + HTM.
285286
```
286287
src/iris/cluster/static/
287288
├── controller/ # Controller dashboard
288-
│ ├── app.js # Main app (tabs, state, data fetching)
289+
│ ├── app.js # Main app (tabs, cluster summary, data fetching)
289290
│ ├── jobs-tab.js # Jobs table with pagination/sorting/tree view
290291
│ ├── job-detail.js # Job detail page with task list
292+
│ ├── fleet-tab.js # Fleet tab: worker health table with inline gauges
293+
│ ├── worker-detail.js # Worker detail page: live resource gauges, task history, logs
291294
│ ├── workers-tab.js # Workers table
292295
│ └── vms-tab.js # VM management table
293-
├── shared/ # Shared utilities
296+
├── shared/ # Shared utilities and components
297+
│ ├── components.js # Reusable Preact components (MetricCard, Gauge, etc.)
294298
│ ├── rpc.js # Connect RPC client wrapper
295-
│ ├── utils.js # Formatting (dates, durations)
296-
│ └── styles.css # Consolidated CSS
297-
├── vendor/ # Third-party ES modules
299+
│ ├── utils.js # Formatting (dates, durations, bytes)
300+
│ └── styles.css # Consolidated CSS with design tokens
301+
├── vendor/ # Third-party ES modules (vendored, not npm)
298302
│ ├── preact.mjs # UI framework
299303
│ └── htm.mjs # HTML template literals
300304
└── worker/ # Worker dashboard components
305+
├── app.js # Worker dashboard: task list, aggregate resources
306+
└── task-detail.js # Task detail: resource usage, auto-refresh
301307
```
302308

303309
**Key patterns:**
304-
- All data fetched via Connect RPC (e.g., `ListJobs`, `GetJobStatus`)
305-
- No REST endpoints - RPC only
306-
- State management with Preact hooks (`useState`, `useEffect`)
310+
- All data fetched via Connect RPC (e.g., `ListJobs`, `GetWorkerStatus`)
311+
- No REST endpoints RPC only. New dashboard features MUST have a backing RPC.
312+
- State management with Preact hooks (`useState`, `useEffect`, `useCallback`)
307313
- HTML templates via `htm.bind(h)` tagged template literals
314+
- Auto-refresh: active pages poll their RPC endpoint (5s for worker/task detail, 30s for controller overview)
308315
- Jobs displayed as a hierarchical tree based on name structure
309316

317+
#### Design System
318+
319+
The dashboard uses CSS custom properties (design tokens) defined in `:root` at the
320+
top of `shared/styles.css`. All new styles should reference these tokens rather
321+
than hard-coding colors, fonts, or shadows.
322+
323+
| Token family | Examples | Purpose |
324+
|------------------|---------------------------------------------|---------------------------------|
325+
| `--color-*` | `--color-accent`, `--color-danger` | Semantic colors |
326+
| `--font-*` | `--font-sans`, `--font-mono` | Typography stacks |
327+
| `--radius-*` | `--radius-sm`, `--radius-md` | Border radius scale |
328+
| `--shadow-*` | `--shadow-sm`, `--shadow-md` | Elevation / depth |
329+
330+
#### Shared Components (`shared/components.js`)
331+
332+
| Component | Purpose |
333+
|------------------|-------------------------------------------------------|
334+
| `MetricCard` | Prominent number + label tile (e.g., "3 Running Tasks") |
335+
| `Gauge` | Horizontal bar gauge with ok/warning/danger thresholds |
336+
| `InlineGauge` | Compact gauge for use inside table cells |
337+
| `ResourceSection`| Titled wrapper for a group of Gauge bars |
338+
| `InfoRow` | Simple label/value pair |
339+
| `InfoCard` | Card container with title |
340+
341+
When adding new dashboard components, add them to `shared/components.js` if
342+
they are reusable across pages. Page-specific components (e.g., `StatusBadge`)
343+
live in the page's own JS file.
344+
345+
#### Resource Display Conventions
346+
347+
- **Live utilization** from heartbeat snapshots (`WorkerResourceSnapshot`) is shown
348+
as Gauge bars with ok/warning/danger color thresholds (70%/90% by default).
349+
- **Static capacity** from worker metadata (CPU cores, total memory) is shown as
350+
MetricCard tiles or Field label/value pairs.
351+
- When live data is available, prefer gauges over raw numbers. Fall back to static
352+
capacity display when no heartbeat snapshots exist yet.
353+
- Use `formatBytes()` for human-readable byte values. Use `formatRelativeTime()`
354+
for timestamps.
355+
356+
#### Navigation Flow
357+
358+
```
359+
Controller Dashboard (/)
360+
├── Jobs tab (default) → click job row → /job/{jobId}
361+
│ └── Job Detail → task list, task logs, resource usage
362+
├── Fleet tab → click worker row → /worker/{id}
363+
│ └── Worker Detail → identity, live resources, task history, logs
364+
├── Endpoints tab
365+
├── Autoscaler tab
366+
├── Logs tab
367+
└── Transactions tab
368+
369+
Worker Dashboard (:worker_port/)
370+
├── Task list with aggregate resource summary
371+
└── Click task → /task/{taskId} → Task Detail (auto-refreshing)
372+
```
373+
310374
**When modifying the dashboard:**
311375
1. Run dashboard tests: `uv run pytest lib/iris/tests/e2e/test_dashboard.py -x -o "addopts="`
312-
2. Ensure any new UI features have corresponding RPC endpoints
376+
2. New UI features MUST have a corresponding RPC endpoint — no internal API calls
313377
3. Follow existing component patterns (functional components, hooks)
378+
4. Preserve CSS class names that E2E tests rely on (`.worker-detail-grid`, `.tab-btn`, `#log-container`, etc.)
379+
5. Use design tokens from `:root` — do not hard-code colors or fonts
314380

315381
## Testing
316382

lib/iris/docs/task-states.md

Lines changed: 276 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,276 @@
1+
# Task States Reference
2+
3+
This document describes the `TaskState` enum, the state machine governing task
4+
lifecycle, retry semantics, and how states appear in the dashboard.
5+
6+
A **task** is the unit of execution in Iris. Each job expands into one or more
7+
tasks (controlled by `replicas`). Tasks are independently scheduled, retried,
8+
and tracked. Job state is derived from task state counts -- there is no
9+
independent job state machine.
10+
11+
## State Diagram
12+
13+
```
14+
+-----------+
15+
| PENDING |<-----------------+
16+
+-----+-----+ |
17+
| |
18+
dispatch to worker |
19+
| |
20+
v |
21+
+-----------+ |
22+
| ASSIGNED | |
23+
+-----+-----+ |
24+
| |
25+
worker starts task |
26+
| |
27+
v |
28+
+-----------+ |
29+
| BUILDING | |
30+
+-----+-----+ |
31+
| |
32+
build completes |
33+
| |
34+
v |
35+
+-----------+ |
36+
| RUNNING | |
37+
+-----+-----+ |
38+
| |
39+
+-------------------+-------------------+ |
40+
| | | |
41+
v v v |
42+
+-----------+ +-----------+ +------------+
43+
| SUCCEEDED | | FAILED |---->| retry |
44+
+-----------+ +-----------+ +------------+
45+
| ^
46+
| exhausted |
47+
v |
48+
(terminal) |
49+
|
50+
+-----------+ |
51+
|WORKER_FAIL|------------+
52+
+-----------+
53+
|
54+
| exhausted
55+
v
56+
(terminal)
57+
58+
Other terminal states: KILLED, UNSCHEDULABLE (never retried)
59+
```
60+
61+
## State Table
62+
63+
| State | Proto Value | Terminal | Retriable | Set By | Dashboard Display |
64+
|---|---|---|---|---|---|
65+
| `UNSPECIFIED` | 0 | -- | -- | Default zero value; never used in practice | `unspecified` (grey) |
66+
| `PENDING` | 1 | No | -- | Job submission (`_on_job_submitted`), retry requeue (`_requeue_task`) | `pending` (amber) |
67+
| `ASSIGNED` | 9 | No | -- | Scheduler dispatch (`_on_task_assigned` / `create_attempt`) | `assigned` (orange) |
68+
| `BUILDING` | 2 | No | -- | Worker heartbeat report; worker sets this during bundle download and dependency sync | `building` (purple) |
69+
| `RUNNING` | 3 | No | -- | Worker heartbeat report; worker sets this when user command starts | `running` (blue) |
70+
| `SUCCEEDED` | 4 | Yes | No | Worker heartbeat report; task exited with code 0 | `succeeded` (green) |
71+
| `FAILED` | 5 | Yes | Yes | Worker heartbeat report; task exited with non-zero code | `failed` (red) |
72+
| `KILLED` | 6 | Yes | No | Controller: job cancellation (`_on_job_cancelled`), job failure cascade (`_mark_remaining_tasks_killed`), per-task timeout | `killed` (grey) |
73+
| `WORKER_FAILED` | 7 | Yes | Yes | Controller: worker death cascade (`_on_worker_failed`), coscheduled sibling kill | `worker_failed` (purple) |
74+
| `UNSCHEDULABLE` | 8 | Yes | No | Controller: scheduling timeout expired (`_mark_task_unschedulable`) | `unschedulable` (red) |
75+
76+
77+
## State Transitions in Detail
78+
79+
### PENDING
80+
81+
The initial state for every task. Set in two contexts:
82+
83+
1. **Job submission**: `_on_job_submitted` calls `expand_job_to_tasks`, which
84+
creates `ControllerTask` objects with `state=TASK_STATE_PENDING`. Tasks are
85+
enqueued into the priority-sorted scheduling queue.
86+
87+
2. **Retry requeue**: `_requeue_task` resets `task.state` to `TASK_STATE_PENDING`
88+
and re-inserts the task into the scheduling queue. This happens after a
89+
retriable `FAILED` or `WORKER_FAILED` when retry budget remains.
90+
91+
### ASSIGNED
92+
93+
Set by `_on_task_assigned` after the scheduler selects a worker and commits
94+
resources. `create_attempt` creates a new `ControllerTaskAttempt` in
95+
`TASK_STATE_ASSIGNED` state. The task is now bound to a specific worker and
96+
consuming its resources.
97+
98+
The worker has not yet acknowledged the task -- it will receive the dispatch
99+
in the next heartbeat cycle.
100+
101+
### BUILDING
102+
103+
Reported by the worker via heartbeat. The worker transitions internally:
104+
105+
- `PENDING -> BUILDING` when bundle download starts (`task_attempt.py:433`)
106+
- Later, `BUILDING` again when dependency sync starts (`task_attempt.py:549`)
107+
108+
The controller processes this transition in `complete_heartbeat`. Note: if the
109+
worker reports `PENDING`, the controller ignores it to prevent regressing an
110+
`ASSIGNED` task and confusing the building-count backpressure window.
111+
112+
### RUNNING
113+
114+
Reported by the worker via heartbeat after the user command starts executing
115+
(`task_attempt.py:570`). The controller records `started_at` on the attempt.
116+
117+
### SUCCEEDED
118+
119+
Reported by the worker via heartbeat when the task process exits with code 0.
120+
The controller sets `exit_code=0`, `finished_at`, and marks the task terminal.
121+
No retry logic applies.
122+
123+
### FAILED
124+
125+
Reported by the worker via heartbeat when the task process exits with a non-zero
126+
code. Triggers retry evaluation:
127+
128+
1. `handle_attempt_result` calls `_handle_failure`, which increments
129+
`failure_count` and compares against `max_retries_failure`.
130+
2. If `failure_count <= max_retries_failure`: returns `SHOULD_RETRY`. The caller
131+
(`_on_task_state_changed`) calls `_requeue_task`, which resets state to
132+
`PENDING` and re-enqueues the task. Resources are released from the current
133+
worker.
134+
3. If `failure_count > max_retries_failure`: returns `EXCEEDED_RETRY_LIMIT`. The
135+
task remains in `FAILED` state and is terminal. `error` and `exit_code` are
136+
recorded.
137+
138+
### KILLED
139+
140+
Set by the controller in three scenarios:
141+
142+
1. **User cancellation**: `_on_job_cancelled` iterates non-terminal tasks and
143+
transitions each to `KILLED`. Tasks with workers assigned are queued for
144+
kill RPCs.
145+
146+
2. **Job failure cascade**: When a job exceeds `max_task_failures`,
147+
`_finalize_job_state` calls `_mark_remaining_tasks_killed` to terminate all
148+
surviving tasks.
149+
150+
3. **Parent job termination**: `_cancel_child_jobs` recursively cancels child
151+
jobs when a parent reaches a terminal state (except `SUCCEEDED`).
152+
153+
`KILLED` is always terminal and never retried.
154+
155+
### WORKER_FAILED
156+
157+
Set by the controller when a worker dies. `_on_worker_failed` iterates all
158+
tasks on the dead worker and emits `TaskStateChangedEvent` with
159+
`TASK_STATE_WORKER_FAILED` for each non-terminal task.
160+
161+
Retry evaluation uses the preemption budget:
162+
163+
1. `_handle_failure` increments `preemption_count` and compares against
164+
`max_retries_preemption` (default: 100).
165+
2. If budget remains: `SHOULD_RETRY` -- task is requeued to `PENDING`.
166+
3. If exhausted: `EXCEEDED_RETRY_LIMIT` -- task stays in `WORKER_FAILED`
167+
and is terminal.
168+
169+
**Coscheduled jobs**: When a task in a coscheduled (gang-scheduled) job fails
170+
terminally, `_cascade_coscheduled_failure` exhausts the preemption budget of
171+
all running siblings and transitions them to `WORKER_FAILED` (terminal). This
172+
prevents other hosts from hanging on collective operations.
173+
174+
### UNSCHEDULABLE
175+
176+
Set by the controller's scheduling loop when a task's scheduling deadline
177+
expires (`_mark_task_unschedulable` in `controller.py`). The deadline is
178+
derived from the job's `scheduling_timeout` field.
179+
180+
`UNSCHEDULABLE` is always terminal. If any task becomes unschedulable, the
181+
entire job transitions to `JOB_STATE_UNSCHEDULABLE` and all remaining tasks
182+
are killed.
183+
184+
## Retry Semantics
185+
186+
Iris maintains two independent retry budgets per task:
187+
188+
| Budget | Counter | Limit Field | Default | Trigger State |
189+
|---|---|---|---|---|
190+
| Failure | `failure_count` | `max_retries_failure` | 0 (no retries) | `FAILED` |
191+
| Preemption | `preemption_count` | `max_retries_preemption` | 100 | `WORKER_FAILED` |
192+
193+
### Retry flow
194+
195+
1. Worker reports terminal state via heartbeat.
196+
2. `handle_attempt_result` delegates to `_handle_failure`.
197+
3. The appropriate counter is incremented.
198+
4. If `counter <= limit`: `TaskTransitionResult.SHOULD_RETRY`.
199+
- `_on_task_state_changed` calls `_requeue_task`.
200+
- Task state is reset to `PENDING`. A new attempt will be created when the
201+
scheduler re-dispatches.
202+
- Worker resources are released via `_cleanup_task_resources`.
203+
5. If `counter > limit`: `TaskTransitionResult.EXCEEDED_RETRY_LIMIT`.
204+
- Task remains in its failure state and is terminal.
205+
- `is_finished()` returns `True`.
206+
- The job's `_compute_job_state` may trigger a job-level state change
207+
(e.g., `JOB_STATE_FAILED` if `max_task_failures` is exceeded).
208+
209+
### What counts toward job failure
210+
211+
Only `TASK_STATE_FAILED` counts toward the job's `max_task_failures` threshold.
212+
Worker failures (preemptions) do not count. This means a job can survive
213+
unlimited preemptions as long as the per-task preemption budget is not
214+
exhausted.
215+
216+
### States that are never retried
217+
218+
- `SUCCEEDED`: task completed successfully
219+
- `KILLED`: explicit termination by user or cascade
220+
- `UNSCHEDULABLE`: scheduling timeout expired
221+
222+
## Terminal State Summary
223+
224+
A task is considered finished (`is_finished() == True`) when:
225+
226+
| State | Condition |
227+
|---|---|
228+
| `SUCCEEDED` | Always finished |
229+
| `KILLED` | Always finished |
230+
| `UNSCHEDULABLE` | Always finished |
231+
| `FAILED` | Finished when `failure_count > max_retries_failure` |
232+
| `WORKER_FAILED` | Finished when `preemption_count > max_retries_preemption` |
233+
234+
The distinction matters: a task in `FAILED` state with retry budget remaining
235+
is in a terminal state at the attempt level but is not finished at the task
236+
level. `can_be_scheduled()` returns `True` for such tasks.
237+
238+
## Dashboard Display
239+
240+
The dashboard uses `stateToName()` from `shared/utils.js` to convert proto enum
241+
strings (e.g., `TASK_STATE_RUNNING`) to lowercase display names by stripping the
242+
`TASK_STATE_` prefix. Each name maps to a CSS class `status-{name}`:
243+
244+
| Display Name | CSS Class | Color |
245+
|---|---|---|
246+
| `pending` | `.status-pending` | Amber (#9a6700) |
247+
| `assigned` | `.status-assigned` | Orange (#bc4c00) |
248+
| `building` | `.status-building` | Purple (#8250df) |
249+
| `running` | `.status-running` | Blue (#0969da) |
250+
| `succeeded` | `.status-succeeded` | Green (#1a7f37) |
251+
| `failed` | `.status-failed` | Red (#cf222e) |
252+
| `killed` | `.status-killed` | Grey (#57606a) |
253+
| `worker_failed` | `.status-worker_failed` | Purple (#8250df) |
254+
| `unschedulable` | `.status-unschedulable` | Red (#cf222e) |
255+
256+
The job detail page shows per-task attempt history. Each attempt has its own
257+
state badge, and worker failures are annotated with "(worker failure)" in the
258+
attempt rows.
259+
260+
Pending tasks display a `pending_reason` diagnostic below the state badge when
261+
the controller can identify why the task cannot be scheduled (e.g., no workers
262+
match constraints).
263+
264+
## Job State Derivation
265+
266+
Job state is computed from task state counts in `_compute_job_state()`:
267+
268+
1. **SUCCEEDED**: All tasks are in `TASK_STATE_SUCCEEDED`.
269+
2. **FAILED**: Count of `TASK_STATE_FAILED` tasks exceeds `max_task_failures`.
270+
3. **UNSCHEDULABLE**: Any task is `TASK_STATE_UNSCHEDULABLE`.
271+
4. **KILLED**: Any task is `TASK_STATE_KILLED` (and job is not already terminal).
272+
5. **RUNNING**: Any task is `ASSIGNED`, `BUILDING`, or `RUNNING`.
273+
6. **PENDING**: Default (no tasks have started).
274+
275+
The ordering matters -- earlier rules take priority. A job with one succeeded
276+
task and one failed task (beyond tolerance) is `FAILED`, not `RUNNING`.

0 commit comments

Comments
 (0)