feat(dashboard-api): add per-service resource metrics endpoint#810
Conversation
Add GET /api/services/resources returning per-service CPU, RAM, and disk metrics. Container stats fetched from host agent with 20s cache, disk usage scanned locally with 60s cache. Includes container name reverse mapping and Docker Desktop memory caveat flag. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lightheartdevs
left a comment
There was a problem hiding this comment.
Audit: APPROVE
Good architecture — parallel fetch with asyncio.to_thread and split cache TTLs (20s for container stats, 60s for disk) is well-designed. The container name reverse-mapping handles naming mismatches correctly.
Auth enforced, read-only. Clean.
Minor notes (non-blocking):
_DATA_DIR_MAPis hardcoded — new services require manual updates. Consider deriving from the SERVICES dict or manifests.- 10s timeout on host agent fetch could cause the entire endpoint to hang when the agent is down. The 20s cache mitigates for subsequent requests.
Depends on #804. Should merge after it.
Lightheartdevs
left a comment
There was a problem hiding this comment.
Revised Audit: REQUEST CHANGES — fatal import crash
Withdrawing my earlier approval after deeper review.
BLOCKER: from helpers import dir_size_gb crashes without PR #804
dir_size_gb does not exist in helpers.py on main — it's a nested function inside main.py:_compute_storage(). PR #804 extracts it. If #810 merges first, the module-level import fails and the entire dashboard-api refuses to start — all endpoints go down, not just this one.
Soft dependency on #808: Without it, SERVICES[sid] lacks container_name. The fallback f"dream-{sid}" works for most services but breaks for open-webui (container is dream-webui, fallback generates dream-open-webui) and langfuse (dream-langfuse-web vs dream-langfuse). These services show null container stats even when running.
Other findings (non-blocking):
_DATA_DIR_MAPis hardcoded and incomplete (~10 services missing) — unmapped services fall back to dir name convention- Cache thundering herd on simultaneous requests with expired TTL — two fetches dispatched, second write wins. Acceptable for dashboard.
- opencode has
container_name: ""creating an empty-string key in the reverse map (cosmetic)
What's good: Auth enforced, no SSRF (URL from server config), no path traversal (child.name only), no info leaks (relative paths), bounded resource usage (always 2 threads max regardless of service count), cache design with split TTLs is well-thought-out.
Addressing review feedbackRe: Re: container_name soft dependency — Acknowledged. Without #808, Required merge order: #804 → #808 → #810 No code changes needed in this PR — all issues are addressed by the dependency chain. |
|
Note: The Rust dashboard-api rewrite (#821) merged and deleted the Python files this PR modifies. This PR added a Python endpoint and router that no longer exist. The per-service resource metrics feature needs to be reimplemented as a Rust endpoint in the dashboard-api crate. Please rebase or rewrite against the current |
What
GET /api/services/resourcesreturning per-service CPU, RAM, and disk metricsrouters/resources.pywith parallel data fetchingWhy
How
asyncio.to_threadfor both host agent HTTP call (container stats) and local disk scan simultaneouslycontainer_name→service_id, correctly handling mismatched names (dream-webui → open-webui)_DATA_DIR_MAPmaps data directory names to service IDs (e.g.,models→llama-server)Three Pillars Impact
asyncio.to_thread(Python 3.9+, container is 3.11). Negligible resource usageNew Files
dashboard-api/routers/resources.py(137 lines)Modified Files
dashboard-api/main.py— router import + registration (2 lines)Testing
Automated
Manual
GET /api/services/resourcesreturns services with CPU/RAM/disk dataReview
Platform Impact
Sequence
PR 6 of 7 (Phase 3). Depends on PR 3 (#804 — host agent stats + dir_size_gb) and PR 5 (#808 — container_name in SERVICES)