Skip to content

fix(extensions): health-based status for user extensions + restart for stopped services#825

Open
yasinBursali wants to merge 2 commits intoLight-Heart-Labs:mainfrom
yasinBursali:fix/extension-status-health-check
Open

fix(extensions): health-based status for user extensions + restart for stopped services#825
yasinBursali wants to merge 2 commits intoLight-Heart-Labs:mainfrom
yasinBursali:fix/extension-status-health-check

Conversation

@yasinBursali
Copy link
Copy Markdown
Contributor

What

Fix user extension status reporting and add restart capability for stopped extensions.

Why

User extensions compute status from file existence (compose.yaml present = "enabled"), not container health. A crashed container still shows "enabled" with a green badge — misleading users into thinking the service is running. This erodes trust in the Extensions portal.

How

  • Health-based status: User extensions now get the same HTTP health checking as core services via a new user_extensions.py scanner module with 30s TTL cache
  • "stopped" status: When compose.yaml exists but the container is unhealthy/down, status is "stopped" (red badge) instead of the previous false "enabled"
  • Start button: Stopped extensions get a "Start" button that restarts the container via the existing host agent start mechanism
  • Enable-when-stopped: enable_extension() now handles two cases — disabled (rename + start) and stopped (scan + start) — both under the extensions lock with symlink guards

Three Pillars Impact

  • Install Reliability: No changes to install flow, env vars, or compose files
  • Broad Compatibility: 30s TTL cache, negligible resource impact, works identically on Linux, macOS (Apple Silicon), and Windows (WSL2)
  • Extension Coherence: dream enable/dream disable unchanged, mod pattern preserved, resolve-compose-stack.sh unaffected

Security

Concern Mitigation
SSRF via manifest host Health check host derived from service_id (Docker DNS), NEVER from manifest default_host/host_env. Regex-validated [a-z0-9][a-z0-9_-]*
URL injection via health path Validated against ^/[A-Za-z0-9/_\-.]*$ + explicit reject of .., @, ?, #, scheme prefixes
Symlink escape Scanner skips symlinked directories. All mutation paths (stopped-enable, disabled-enable, disable) check os.lstat + S_ISLNK
YAML deserialization yaml.safe_load() everywhere — no unsafe yaml.load()
Malformed manifest DoS Per-item try/except (TypeError, ValueError) prevents one broken manifest from crashing catalog for all users
Race condition (TOCTOU) Stopped-enable path holds _extensions_lock() around symlink check + compose scan. Agent call outside lock (matches codebase pattern)
Compose content validation _scan_compose_content() runs on stopped-enable path (not just disabled-enable), preventing bypass of security scanner

Test Coverage

New: test_user_extensions.py — 14 tests

  • Scanner: empty dir, nonexistent dir, enabled ext, disabled skipped, missing manifest, no health endpoint, symlink skipped
  • Security: 6 bad health paths rejected (traversal, authority injection, query/fragment, schemes), host always equals service_id, name fallback
  • Cache: TTL returns cached, TTL expires rescans, reset clears

New/Updated: test_extensions.py — 9 new/updated tests

  • Lifecycle: healthy→enabled, unhealthy→stopped, disabled unchanged, core service unchanged, stopped count in summary
  • Enable-when-stopped: starts without rename (200, was 409), symlink rejected (400)
  • Updated: file-based "enabled" → "stopped" for unhealthy containers

Frontend

  • vite build: clean
  • eslint: 0 errors (5 pre-existing warnings in unrelated files)

Manual test checklist (all platforms)

  • Stop user extension container → status changes to "stopped" (red badge) on refresh
  • Click "Start" on stopped extension → container restarts, status returns to "enabled"
  • Disable/enable flow unchanged
  • Extension without manifest → silently skipped, no 500
  • Filter by "Stopped" → shows only stopped extensions
  • Summary bar shows correct "Stopped" count

New Files

File Purpose
extensions/services/dashboard-api/user_extensions.py Dynamic manifest scanner with SSRF-safe host derivation, health path validation, thread-safe TTL cache
extensions/services/dashboard-api/tests/test_user_extensions.py 14 tests for scanner + cache

Modified Files

File Changes
extensions/services/dashboard-api/routers/extensions.py _compute_extension_status() health-based for user extensions; catalog/detail include user ext health via asyncio.gather; enable_extension() handles stopped case with lock + symlink guard + compose scan
extensions/services/dashboard-api/tests/test_extensions.py 7 new tests + 2 updated for lifecycle status and enable-when-stopped
extensions/services/dashboard/src/pages/Extensions.jsx Stopped badge (red), Start button, STATUS_FILTERS/LABELS, summary counter

Platform Impact

  • macOS (Apple Silicon): Supported — health checks via Docker internal DNS, pathlib for all file ops
  • Linux: Supported — identical code paths
  • Windows (WSL2): Supported — identical code paths, os.rename atomic on ext4

Sequence

PR 1 of 3 — Extension Lifecycle States

  • PR 1: Fix user extension status + restart (this PR)
  • PR 2: Install progress tracking (host agent + backend) — depends on this PR
  • PR 3: Frontend progress UI + error display — depends on PR 1 + PR 2

Note for PR 2: PRs modifying dream-host-agent.py (#804, #806) have now merged. PR 2 will rebase cleanly.

🤖 Generated with Claude Code

Copy link
Copy Markdown
Collaborator

@Lightheartdevs Lightheartdevs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Audit: REQUEST CHANGES — CI failing, good PR otherwise

This fixes a real UX bug — crashed/stopped extensions showing false "enabled" (green) status. The health-based status approach is correct and the security model is sound.

Security verified:

  • SSRF mitigated: health-check hosts derived from directory names, never from manifest input ✓
  • Health paths validated with regex + reject list (no .., @, ?, #, protocol prefixes) ✓
  • Symlink checks on stopped-enable path under file lock ✓
  • Compose content re-scanned before restart ✓
  • Auth enforced on all endpoints ✓
  • No shell injection (agent call via HTTP + JSON, agent uses list args) ✓
  • YAML safe_load only ✓

2 CI failures need fixing:

  1. Ruff lint: Unused import os and MagicMock in test_extensions.py — trivial, just remove them.

  2. 2 cache tests failing: TestCaching::test_cache_returns_same_result and TestCaching::test_reset_cache both get AssertionError: assert 'my-ext' in {}. The scan function works correctly in other tests (TestScanUserExtensions all pass), so this appears to be a cache implementation or test isolation bug. The first call to get_user_services_cached() returns empty when it should find the extension.

Minor notes (non-blocking):

  • The user extension health-check block is duplicated identically in extensions_catalog() and extension_detail() — consider extracting to a helper
  • If a health check times out, the extension shows "stopped" even if the container is running — acceptable trade-off vs the old false "enabled", but worth documenting
  • The stopped-enable path calls _call_agent outside the lock — matches existing pattern, intentional

yasinBursali and others added 2 commits April 6, 2026 16:36
…r stopped services

User extensions now get real health checking instead of file-based status.
"stopped" status (red badge) replaces the misleading "enabled" for crashed
containers. Start button allows restarting stopped extensions without the
disable-enable workaround.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove unused os and MagicMock imports (ruff F401)
- Use float('-inf') as cache timestamp sentinel instead of 0.0 to
  prevent false cache hits on fresh CI VMs where time.monotonic()
  may be smaller than the TTL value

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yasinBursali yasinBursali force-pushed the fix/extension-status-health-check branch from d9d96ec to fc6806e Compare April 6, 2026 13:44
@yasinBursali
Copy link
Copy Markdown
Contributor Author

Both CI failures fixed:

  1. Ruff lint — Removed unused os and MagicMock imports from test_extensions.py.

  2. Cache tests — Root cause: _cache["timestamp"] was initialized to 0.0. On fresh CI VMs, time.monotonic() can be smaller than the TTL (e.g. <60s since boot), so now - 0.0 < ttl returned True — the cache returned its initial empty {} without ever scanning. Fixed by using float("-inf") as the sentinel, which guarantees now - (-inf) = inf > any ttl.

Rebased on latest main, all checks green.

@Lightheartdevs
Copy link
Copy Markdown
Collaborator

Merge conflict with mainExtensions.jsx was updated by #829 (liquid metal design migration) which replaced all hardcoded zinc/indigo colors with theme variables. Your branch's Extensions.jsx changes will conflict with the new class names.

When rebasing, the key mapping is:

  • bg-zinc-*bg-theme-card / bg-theme-bg
  • text-zinc-400/500/600text-theme-text-muted
  • text-whitetext-theme-text
  • border-zinc-*border-theme-border
  • bg-indigo-500bg-theme-accent
  • text-indigo-*text-theme-accent-light

For any NEW UI elements you're adding (stopped badge, start button), please use theme variables instead of hardcoded colors. Semantic status colors (green/red/orange/amber) stay as-is.

CI was all green before the conflict — once rebased this should be ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants