Skip to content

Set min-slices and fix worker dashboard.#3222

Merged
rjpower merged 2 commits intomainfrom
rjpower/20260303-dashboard-cleanup
Mar 4, 2026
Merged

Set min-slices and fix worker dashboard.#3222
rjpower merged 2 commits intomainfrom
rjpower/20260303-dashboard-cleanup

Conversation

@rjpower
Copy link
Collaborator

@rjpower rjpower commented Mar 4, 2026

No description provided.

@claude
Copy link
Contributor

claude bot commented Mar 4, 2026

Claude finished @rjpower's task in 1m 32s —— View job


Review

This PR does three things: (1) bumps min_slices from 0 to 1 for two TPU scale groups in marin.yaml, (2) replaces the worker detail's inline Gauge-based utilization section with a grid-based panel using dedicated CSS classes and larger sparklines, and (3) fixes the Sparkline component to render with a single data point or all-zero data instead of returning null. Tests are updated accordingly and a new history-accumulation test is added. Changes look correct and well-structured.

Findings:

  • worker-detail.js:14ResourceSection and Gauge are imported but no longer used after this refactor. Dead imports.
  • test_dashboard.py — The import time inside test_worker_detail_sparklines_with_history (not inside a serialized worker function) is a local import that could be at the top of the file per AGENTS.md guidelines. The existing _hold_for_heartbeats local import is fine since it's serialized to a worker.

No correctness issues or behavioral regressions found.

@rjpower rjpower requested a review from yonromai March 4, 2026 03:47
@rjpower rjpower enabled auto-merge (squash) March 4, 2026 03:56
@rjpower
Copy link
Collaborator Author

rjpower commented Mar 4, 2026

@claude address PR feedback:

worker-detail.js:14 — ResourceSection and Gauge are imported but no longer used after this refactor. Dead imports.
test_dashboard.py — The import time inside test_worker_detail_sparklines_with_history (not inside a serialized worker function) is a local import that could be at the top of the file per AGENTS.md guidelines. The existing _hold_for_heartbeats local import is fine since it's serialized to a worker.

@claude
Copy link
Contributor

claude bot commented Mar 4, 2026

Claude finished @rjpower's task in 1m 39s —— View job


Addressing PR feedback

  • Remove dead imports (ResourceSection, Gauge) from worker-detail.js
  • Move import time to top of test_dashboard.py
  • Run pre-commit and tests
  • Commit and push

Both issues fixed in 92b43b8:

  1. worker-detail.js:14 — Removed unused ResourceSection and Gauge from the import.
  2. test_dashboard.py — Moved import time to the top of the file (lines 179 and 249 had local imports in test functions that aren't serialized to workers). The import time inside _hold_for_heartbeats and _allocate_memory_and_wait was left as-is since those are serialized to workers.

Testing: Pre-commit passed. Test collection for test_dashboard.py verified (e2e tests require Playwright + a running cluster so were not executed end-to-end in CI).

- Remove unused ResourceSection and Gauge imports from worker-detail.js
- Move import time to top of test_dashboard.py per coding guidelines

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
Copy link
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, with one high-priority config regression to address before merge: the min_slices change in marin.yaml keeps five zoned TPU slices warm at idle. The dashboard and sparkline changes themselves look good, and the targeted worker-detail E2E tests passed locally.

Generated with Codex.

priority: 10
resources: { cpu: 112, ram: 192GB, disk: 100GB, tpu_count: 4, gpu_count: 0 }
min_slices: 0
min_slices: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Priority: High

Setting min_slices: 1 here, and again for tpu_v6e_4, turns these zoned groups into five permanently warm 4-chip TPU slices, not one global warm pool. load_config() preserves explicit min_slices on every zone-expanded group, and the autoscaler enforces that floor even with zero demand, so this change stops these pools from scaling back to zero. If the goal is to reduce cold-start latency, this needs a single dedicated warm pool or some other global mechanism rather than min_slices on multi-zone groups.

Recommended fix: revert these min_slices changes, or replace them with a design that keeps at most one intentional warm slice instead of one per zone-expanded group.

Generated with Codex.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is okay and intentional, we want to keep slices warm.

@rjpower rjpower merged commit d0348a7 into main Mar 4, 2026
15 checks passed
@rjpower rjpower deleted the rjpower/20260303-dashboard-cleanup branch March 4, 2026 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants