feat(iris): add request-level observability to controller RPCs (#3073)

yonromai · claude · web-flow · commit 9c372ee1173f · 2026-02-26T13:54:20.000-08:00
Fixes #3071. Related: #3062 (incident), #3067 (retry fix), #2809 (threading survey), #2964 (lock contention). ## Problem When the GPU canary ferry hit `DEADLINE_EXCEEDED` (#3062), the controller was alive but we had **no way to tell why `GetTaskLogs` took 30s+**. The retry fix (#3067) treats the symptom. We need observability to find the root cause next time. Today: zero RPC durations, zero storage latency measurements, zero heartbeat round timings, no periodic health snapshot. ## Changes Five logging hooks — one new file (`interceptors.py`), edits to three existing files. Logging only, no metrics infra. All thresholds are hardcoded initial guesses. | # | Change | Key output | |---|--------|------------| | 1 | RPC timing interceptor (`UnaryInterceptorSync`) | `WARN RPC GetTaskLogs completed in 31204ms (slow)` | | 2 | Storage read timing in `get_task_logs` | `WARN Storage read for job/0 attempt 0: 28500ms (slow)` | | 3 | Heartbeat + scheduling phase-level timing | `DEBUG Heartbeat round: 128 workers, 3 failed, 4200ms (snapshot: 3100ms)` | | 4 | Periodic health summary (~30s) | `INFO Controller status: 128 workers (2 failed), 2 active jobs, 0 pending` | | 5 | Uvicorn log level `"error"` → `"warning"` | Surfaces connection warnings from GH Actions | The snapshot sub-timing in Change 3 disambiguates lock contention from slow RPCs: if `snapshot_ms` dominates the round time, the `ControllerState` RLock is contended. If `elapsed >> snapshot_ms`, slow worker RPCs are the cause. **Diagnostic trail — slow storage:** ``` INFO Controller status: 128 workers (0 failed), 2 active jobs, 0 pending tasks DEBUG Heartbeat round: 128 workers, 3 failed, 4200ms (snapshot: 45ms) WARN Storage read for job/0/0 attempt 0: 28500ms (slow) WARN RPC GetTaskLogs completed in 31204ms (slow) ``` **Diagnostic trail — lock contention:** ``` WARN Heartbeat round: 128 workers, 0 failed, 6200ms (snapshot: 5800ms) WARN RPC GetTaskLogs completed in 31204ms (slow) ``` <details><summary>Scope boundaries</summary> - **Logging only** — no Prometheus/OTel. A metrics stack is a separate effort (#2826). - **Hardcoded thresholds** (RPC: 1s, storage: 2s, heartbeat: 5s) — tunable later once we have baseline data. - **Phase-level timing as lock-contention proxy** — we time the lock-acquiring phases from outside rather than instrumenting `ControllerState`'s RLock (too invasive, too noisy). - **`DEBUG` for normal, `WARNING` for slow/errored** — no `INFO`-level RPC logging (too noisy at 128 workers × 1s polling). - **Not in scope**: performance fixes, K8s CPU allocation ([per @rjpower](#3071 (comment))), structured logging format. </details> ## Test plan - [x] Unit tests for `RequestTimingInterceptor` (pass-through + re-raise contract) - [x] Existing controller tests pass (268 tests) - [ ] Post-merge: verify `DEBUG RPC ...` on dashboard navigation, `DEBUG Heartbeat round:` every ~5s, `INFO Controller status:` every ~30s, no `WARNING` under normal load Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
diff --git a/lib/iris/src/iris/cluster/controller/controller.py b/lib/iris/src/iris/cluster/controller/controller.py
@@ -44,10 +44,13 @@
 from iris.managed_thread import ManagedThread, ThreadContainer, get_thread_container
 from iris.rpc import cluster_pb2
 from iris.rpc.cluster_connect import WorkerServiceClientSync
-from iris.time_utils import Duration, ExponentialBackoff
+from iris.time_utils import Duration, ExponentialBackoff, Timer
 
 logger = logging.getLogger(__name__)
 
+_SLOW_HEARTBEAT_MS = 5000
+_HEALTH_SUMMARY_INTERVAL = 6  # every ~30s at 5s heartbeat interval
+
 
 def job_requirements_from_job(job: ControllerJob) -> JobRequirements:
     """Convert a ControllerJob to scheduler-compatible JobRequirements."""
@@ -274,6 +277,8 @@ def __init__(
         # Autoscaler (passed in, configured in start() if provided)
         self._autoscaler: Autoscaler | None = autoscaler
 
+        self._heartbeat_iteration = 0
+
     def wake(self) -> None:
         """Signal the controller loop to run immediately.
 
@@ -298,7 +303,7 @@ def start(self) -> None:
             self._dashboard._app,
             host=self._config.host,
             port=self._config.port,
-            log_level="error",
+            log_level="warning",
             timeout_keep_alive=120,
         )
         self._server = uvicorn.Server(server_config)
@@ -381,8 +386,10 @@ def _run_scheduling(self) -> None:
         No lock is needed since only one scheduling thread exists. All state
         reads and writes go through ControllerState which has its own lock.
         """
+        timer = Timer()
         pending_tasks = self._state.peek_pending_tasks()
         workers = self._state.get_available_workers()
+        state_read_ms = timer.elapsed_ms()
 
         if not pending_tasks:
             return
@@ -418,6 +425,12 @@ def _run_scheduling(self) -> None:
         # Buffer assignments for heartbeat delivery (commits resources via TaskAssignedEvent)
         if result.assignments:
             self._buffer_assignments(result.assignments)
+            logger.debug(
+                "Scheduling cycle: %d assignments, %dms (state read: %dms)",
+                len(result.assignments),
+                timer.elapsed_ms(),
+                state_read_ms,
+            )
 
     def _buffer_assignments(
         self,
@@ -543,12 +556,17 @@ def _heartbeat_all_workers(self) -> None:
         _on_worker_failed prunes it from state. We detect this (worker no longer
         in state) and evict the cached stub + notify the autoscaler.
         """
-        # Phase 1: create snapshots for all healthy workers
+        round_timer = Timer()
+
+        # Phase 1: create snapshots for all healthy workers (lock-acquiring).
+        # Timing this phase separately gives a lock-contention signal.
+        snapshot_timer = Timer()
         snapshots: list[HeartbeatSnapshot] = []
         for w in self._state.get_available_workers():
             snapshot = self._state.begin_heartbeat(w.worker_id)
             if snapshot:
                 snapshots.append(snapshot)
+        snapshot_ms = snapshot_timer.elapsed_ms()
 
         if not snapshots:
             return
@@ -578,9 +596,11 @@ def _dispatch_worker() -> None:
         worker_futures = [self._dispatch_executor.submit(_dispatch_worker) for _ in range(worker_count)]
 
         # Phase 3: consume all responses; per-worker RPC timeout determines failures.
+        fail_count = 0
         for _ in snapshots:
             snapshot, response, error = result_queue.get()
             if error is not None:
+                fail_count += 1
                 logger.warning("Heartbeat error for %s: %s", snapshot.worker_id, error)
                 self._handle_heartbeat_failure(snapshot, error)
                 continue
@@ -590,6 +610,31 @@ def _dispatch_worker() -> None:
         for future in worker_futures:
             future.cancel()
 
+        elapsed = round_timer.elapsed_ms()
+        level = logging.WARNING if elapsed > _SLOW_HEARTBEAT_MS else logging.DEBUG
+        logger.log(
+            level,
+            "Heartbeat round: %d workers, %d failed, %dms (snapshot: %dms)",
+            len(snapshots),
+            fail_count,
+            elapsed,
+            snapshot_ms,
+        )
+
+        self._heartbeat_iteration += 1
+        if self._heartbeat_iteration % _HEALTH_SUMMARY_INTERVAL == 0:
+            workers = self._state.get_available_workers()
+            jobs = self._state.list_all_jobs()
+            active = sum(1 for j in jobs if j.state == cluster_pb2.JOB_STATE_RUNNING)
+            pending = len(self._state.peek_pending_tasks())
+            logger.info(
+                "Controller status: %d workers (%d failed), %d active jobs, %d pending tasks",
+                len(workers),
+                fail_count,
+                active,
+                pending,
+            )
+
     def _handle_heartbeat_failure(self, snapshot: HeartbeatSnapshot, error: str) -> None:
         """Process a heartbeat failure: update state, evict stub + notify autoscaler if worker died.
 
diff --git a/lib/iris/src/iris/cluster/controller/dashboard.py b/lib/iris/src/iris/cluster/controller/dashboard.py
@@ -27,6 +27,7 @@
 from iris.cluster.dashboard_common import html_shell, static_files_mount
 from iris.rpc import cluster_pb2
 from iris.rpc.cluster_connect import ControllerServiceWSGIApplication
+from iris.rpc.interceptors import RequestTimingInterceptor
 
 logger = logging.getLogger(__name__)
 
@@ -57,7 +58,7 @@ def port(self) -> int:
         return self._port
 
     def _create_app(self) -> Starlette:
-        rpc_wsgi_app = ControllerServiceWSGIApplication(service=self._service)
+        rpc_wsgi_app = ControllerServiceWSGIApplication(service=self._service, interceptors=[RequestTimingInterceptor()])
         rpc_app = WSGIMiddleware(rpc_wsgi_app)
 
         routes = [
diff --git a/lib/iris/src/iris/cluster/controller/service.py b/lib/iris/src/iris/cluster/controller/service.py
@@ -42,12 +42,13 @@
 from iris.rpc.cluster_connect import WorkerServiceClientSync
 from iris.rpc.errors import rpc_error_handler
 from iris.rpc.proto_utils import task_state_name
-from iris.time_utils import Timestamp
+from iris.time_utils import Timer, Timestamp
 
 logger = logging.getLogger(__name__)
 
 DEFAULT_TRANSACTION_LIMIT = 50
 DEFAULT_MAX_TOTAL_LINES = 10000
+_SLOW_STORAGE_READ_MS = 2000
 
 # Maximum bundle size in bytes (25 MB) - matches client-side limit
 MAX_BUNDLE_SIZE_BYTES = 25 * 1024 * 1024
@@ -919,12 +920,21 @@ def get_task_logs(
                     continue
 
                 try:
+                    storage_timer = Timer()
                     reader = task_logging.LogReader.from_log_directory(log_directory=attempt.log_directory)
                     log_entries = reader.read_logs(
                         source=None,  # All sources
                         regex_filter=request.regex if request.regex else None,
                         max_lines=max(0, max_lines - total_lines) if max_lines > 0 else 0,
                     )
+                    storage_elapsed = storage_timer.elapsed_ms()
+                    if storage_elapsed > _SLOW_STORAGE_READ_MS:
+                        logger.warning(
+                            "Storage read for %s attempt %d: %dms (slow)",
+                            task_id_wire,
+                            attempt.attempt_id,
+                            storage_elapsed,
+                        )
 
                     worker_logs = []
                     for entry in log_entries:
diff --git a/lib/iris/src/iris/rpc/interceptors.py b/lib/iris/src/iris/rpc/interceptors.py
@@ -0,0 +1,29 @@
+# Copyright 2025 The Marin Authors
+# SPDX-License-Identifier: Apache-2.0
+
+import logging
+
+from iris.time_utils import Timer
+
+logger = logging.getLogger(__name__)
+
+_SLOW_RPC_THRESHOLD_MS = 1000
+
+
+class RequestTimingInterceptor:
+    """Logs method name + duration for every unary RPC."""
+
+    def intercept_unary_sync(self, call_next, request, ctx):
+        method = ctx.method().name
+        timer = Timer()
+        try:
+            response = call_next(request, ctx)
+            elapsed = timer.elapsed_ms()
+            if elapsed > _SLOW_RPC_THRESHOLD_MS:
+                logger.warning("RPC %s completed in %dms (slow)", method, elapsed)
+            else:
+                logger.debug("RPC %s completed in %dms", method, elapsed)
+            return response
+        except Exception as e:
+            logger.warning("RPC %s failed after %dms: %s", method, timer.elapsed_ms(), e)
+            raise
diff --git a/lib/iris/tests/rpc/test_interceptors.py b/lib/iris/tests/rpc/test_interceptors.py
@@ -0,0 +1,38 @@
+# Copyright 2025 The Marin Authors
+# SPDX-License-Identifier: Apache-2.0
+
+from dataclasses import dataclass
+from unittest.mock import Mock
+
+import pytest
+
+from iris.rpc.interceptors import RequestTimingInterceptor
+
+
+@dataclass(frozen=True)
+class FakeMethodInfo:
+    name: str
+
+
+def _make_ctx(method_name: str):
+    ctx = Mock()
+    ctx.method.return_value = FakeMethodInfo(name=method_name)
+    return ctx
+
+
+def test_interceptor_passes_through_response():
+    interceptor = RequestTimingInterceptor()
+    ctx = _make_ctx("GetTaskLogs")
+    result = interceptor.intercept_unary_sync(lambda req, ctx: "ok", "request", ctx)
+    assert result == "ok"
+
+
+def test_interceptor_reraises_exceptions():
+    interceptor = RequestTimingInterceptor()
+    ctx = _make_ctx("LaunchJob")
+
+    def failing_handler(req, ctx):
+        raise ValueError("boom")
+
+    with pytest.raises(ValueError, match="boom"):
+        interceptor.intercept_unary_sync(failing_handler, "request", ctx)