Commit 9c372ee
feat(iris): add request-level observability to controller RPCs (#3073)
Fixes #3071. Related: #3062 (incident), #3067 (retry fix), #2809
(threading survey), #2964 (lock contention).
## Problem
When the GPU canary ferry hit `DEADLINE_EXCEEDED` (#3062), the
controller was alive but we had **no way to tell why `GetTaskLogs` took
30s+**. The retry fix (#3067) treats the symptom. We need observability
to find the root cause next time.
Today: zero RPC durations, zero storage latency measurements, zero
heartbeat round timings, no periodic health snapshot.
## Changes
Five logging hooks — one new file (`interceptors.py`), edits to three
existing files. Logging only, no metrics infra. All thresholds are
hardcoded initial guesses.
| # | Change | Key output |
|---|--------|------------|
| 1 | RPC timing interceptor (`UnaryInterceptorSync`) | `WARN RPC
GetTaskLogs completed in 31204ms (slow)` |
| 2 | Storage read timing in `get_task_logs` | `WARN Storage read for
job/0 attempt 0: 28500ms (slow)` |
| 3 | Heartbeat + scheduling phase-level timing | `DEBUG Heartbeat
round: 128 workers, 3 failed, 4200ms (snapshot: 3100ms)` |
| 4 | Periodic health summary (~30s) | `INFO Controller status: 128
workers (2 failed), 2 active jobs, 0 pending` |
| 5 | Uvicorn log level `"error"` → `"warning"` | Surfaces connection
warnings from GH Actions |
The snapshot sub-timing in Change 3 disambiguates lock contention from
slow RPCs: if `snapshot_ms` dominates the round time, the
`ControllerState` RLock is contended. If `elapsed >> snapshot_ms`, slow
worker RPCs are the cause.
**Diagnostic trail — slow storage:**
```
INFO Controller status: 128 workers (0 failed), 2 active jobs, 0 pending tasks
DEBUG Heartbeat round: 128 workers, 3 failed, 4200ms (snapshot: 45ms)
WARN Storage read for job/0/0 attempt 0: 28500ms (slow)
WARN RPC GetTaskLogs completed in 31204ms (slow)
```
**Diagnostic trail — lock contention:**
```
WARN Heartbeat round: 128 workers, 0 failed, 6200ms (snapshot: 5800ms)
WARN RPC GetTaskLogs completed in 31204ms (slow)
```
<details><summary>Scope boundaries</summary>
- **Logging only** — no Prometheus/OTel. A metrics stack is a separate
effort (#2826).
- **Hardcoded thresholds** (RPC: 1s, storage: 2s, heartbeat: 5s) —
tunable later once we have baseline data.
- **Phase-level timing as lock-contention proxy** — we time the
lock-acquiring phases from outside rather than instrumenting
`ControllerState`'s RLock (too invasive, too noisy).
- **`DEBUG` for normal, `WARNING` for slow/errored** — no `INFO`-level
RPC logging (too noisy at 128 workers × 1s polling).
- **Not in scope**: performance fixes, K8s CPU allocation ([per
@rjpower](#3071 (comment))),
structured logging format.
</details>
## Test plan
- [x] Unit tests for `RequestTimingInterceptor` (pass-through + re-raise
contract)
- [x] Existing controller tests pass (268 tests)
- [ ] Post-merge: verify `DEBUG RPC ...` on dashboard navigation, `DEBUG
Heartbeat round:` every ~5s, `INFO Controller status:` every ~30s, no
`WARNING` under normal load
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>1 parent 82d1a79 commit 9c372ee
File tree
5 files changed
+128
-5
lines changed- lib/iris
- src/iris
- cluster/controller
- rpc
- tests/rpc
5 files changed
+128
-5
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
47 | | - | |
| 47 | + | |
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
51 | 54 | | |
52 | 55 | | |
53 | 56 | | |
| |||
274 | 277 | | |
275 | 278 | | |
276 | 279 | | |
| 280 | + | |
| 281 | + | |
277 | 282 | | |
278 | 283 | | |
279 | 284 | | |
| |||
298 | 303 | | |
299 | 304 | | |
300 | 305 | | |
301 | | - | |
| 306 | + | |
302 | 307 | | |
303 | 308 | | |
304 | 309 | | |
| |||
381 | 386 | | |
382 | 387 | | |
383 | 388 | | |
| 389 | + | |
384 | 390 | | |
385 | 391 | | |
| 392 | + | |
386 | 393 | | |
387 | 394 | | |
388 | 395 | | |
| |||
418 | 425 | | |
419 | 426 | | |
420 | 427 | | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
421 | 434 | | |
422 | 435 | | |
423 | 436 | | |
| |||
543 | 556 | | |
544 | 557 | | |
545 | 558 | | |
546 | | - | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
547 | 564 | | |
548 | 565 | | |
549 | 566 | | |
550 | 567 | | |
551 | 568 | | |
| 569 | + | |
552 | 570 | | |
553 | 571 | | |
554 | 572 | | |
| |||
578 | 596 | | |
579 | 597 | | |
580 | 598 | | |
| 599 | + | |
581 | 600 | | |
582 | 601 | | |
583 | 602 | | |
| 603 | + | |
584 | 604 | | |
585 | 605 | | |
586 | 606 | | |
| |||
590 | 610 | | |
591 | 611 | | |
592 | 612 | | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
| 620 | + | |
| 621 | + | |
| 622 | + | |
| 623 | + | |
| 624 | + | |
| 625 | + | |
| 626 | + | |
| 627 | + | |
| 628 | + | |
| 629 | + | |
| 630 | + | |
| 631 | + | |
| 632 | + | |
| 633 | + | |
| 634 | + | |
| 635 | + | |
| 636 | + | |
| 637 | + | |
593 | 638 | | |
594 | 639 | | |
595 | 640 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| 30 | + | |
30 | 31 | | |
31 | 32 | | |
32 | 33 | | |
| |||
57 | 58 | | |
58 | 59 | | |
59 | 60 | | |
60 | | - | |
| 61 | + | |
61 | 62 | | |
62 | 63 | | |
63 | 64 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
42 | 42 | | |
43 | 43 | | |
44 | 44 | | |
45 | | - | |
| 45 | + | |
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
| 51 | + | |
51 | 52 | | |
52 | 53 | | |
53 | 54 | | |
| |||
919 | 920 | | |
920 | 921 | | |
921 | 922 | | |
| 923 | + | |
922 | 924 | | |
923 | 925 | | |
924 | 926 | | |
925 | 927 | | |
926 | 928 | | |
927 | 929 | | |
| 930 | + | |
| 931 | + | |
| 932 | + | |
| 933 | + | |
| 934 | + | |
| 935 | + | |
| 936 | + | |
| 937 | + | |
928 | 938 | | |
929 | 939 | | |
930 | 940 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
0 commit comments