Skip to content

feat(extensions): install progress tracking with host agent orchestration#827

Open
yasinBursali wants to merge 2 commits intoLight-Heart-Labs:mainfrom
yasinBursali:feat/extension-install-progress
Open

feat(extensions): install progress tracking with host agent orchestration#827
yasinBursali wants to merge 2 commits intoLight-Heart-Labs:mainfrom
yasinBursali:feat/extension-install-progress

Conversation

@yasinBursali
Copy link
Copy Markdown
Contributor

What

Add install progress tracking for extensions — the host agent writes stage-by-stage progress during install, dashboard-api serves it via a polling endpoint, and the catalog reflects real-time install status.

Why

When installing an extension, users see only a spinner for up to 5 minutes with no indication of what's happening. If Docker is pulling a large image on slow internet, there's no progress feedback. If it fails, the error is generic. This makes the install experience feel broken even when it's working.

How

  • Host agent: New /v1/extension/install endpoint returns 202 immediately, runs setup_hook → pull → start in a background thread. Writes atomic progress files (data/extension-progress/{service_id}.json) at each stage.
  • Dashboard-API: New GET /api/extensions/{service_id}/progress endpoint for frontend polling. _compute_extension_status() checks progress files first — returns "installing", "setting_up", or "error" based on active install state.
  • Combined install: Replaces the two-call pattern (setup_hook then start) with a single _call_agent_install() that handles the full lifecycle.

Architecture

Frontend → POST /install → Dashboard-API → POST /v1/extension/install → Host Agent
                                                    ↓ 202 (immediate)
                           ← 200 {progress_endpoint} ←
Frontend → GET /progress → Dashboard-API → reads data/extension-progress/{id}.json
                           ← {status: "pulling", phase_label: "Downloading image..."}

Security

Concern Mitigation
Auth bypass via background thread Auth + validate_service_id + lock acquired BEFORE 202 response
Bearer token in Docker stderr _BEARER_RE strips tokens before writing to progress file
Lock leak on client disconnect try/except around json_response + Thread.start; lock.release() in except
Subprocess injection List-form subprocess.run, no shell=True, service_id is regex-validated
Setup hook path traversal _resolve_setup_hook() validates path containment via resolve().relative_to()
Progress file path traversal service_id validated against ^[a-z0-9][a-z0-9_-]*$ before use in filename

Test Coverage

10 new tests in TestInstallProgress:

  • Progress endpoint: idle when no file, returns active data
  • Status mapping: pulling→installing, starting→installing, setup_hook→setting_up, error→error
  • Staleness: old progress ignored, stale error preserved
  • Cleanup: removes old "started" files
  • Install response: includes progress_endpoint field

All tests: 80 passed, 0 failed

Three Pillars Impact

  • Install Reliability: Improved — users see progress, errors are visible with Docker stderr
  • Broad Compatibility: 600s timeout (existing SUBPROCESS_TIMEOUT_START), atomic file writes, all 3 platforms
  • Extension Coherence: Only affects install path. dream enable/disable unchanged.

Platform Impact

  • macOS (Apple Silicon): Supported — POSIX atomic rename, same data/ volume
  • Linux: Supported — identical code paths
  • Windows (WSL2): Supported — atomic rename on ext4

Manual Test Steps (all platforms)

  • Install extension → host agent returns 202, progress file created
  • Poll progress endpoint during install → see "pulling", "starting" transitions
  • Catalog shows "installing" during active install
  • Failed install → progress file shows "error" with Docker stderr
  • Pull failure (offline) → proceeds to start with cached image
  • Concurrent install on same service → 409 rejection
  • Stale "started" progress file → cleaned up after 15 min on catalog fetch

Known Tech Debt (follow-up)

  • _handle_setup_hook() has its own inline manifest parsing — should be refactored to use _resolve_setup_hook() shared helper
  • Bearer regex doesn't cover URL-embedded credentials — edge case for private registries

Sequence

PR 2 of 3 — Extension Lifecycle States

Note: This PR is stacked on #825. Merge #825 first, then this PR will rebase cleanly.

🤖 Generated with Claude Code

yasinBursali and others added 2 commits April 6, 2026 16:06
…r stopped services

User extensions now get real health checking instead of file-based status.
"stopped" status (red badge) replaces the misleading "enabled" for crashed
containers. Start button allows restarting stopped extensions without the
disable-enable workaround.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion

Host agent writes stage-by-stage progress files during extension install
(setup_hook → pulling → starting → done/error). Dashboard-API serves
progress via polling endpoint and reflects install status in catalog.
Replaces two-call install pattern with single 202+background-thread endpoint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@Lightheartdevs Lightheartdevs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Audit: REQUEST CHANGES — CI failing + depends on #825 + merge conflict

The architecture is solid — background thread install with atomic progress files, 202 immediate response, stage-by-stage tracking (setup_hook → pull → start). Good UX improvement over the current spinner-only experience.

Security looks good from the description:

  • Auth enforced before side effects
  • Service IDs regex-validated
  • subprocess uses list args (no shell=True)
  • Progress files are atomic writes

Blockers:

  1. CI failing: Ruff lint + api test failures. Need to fix unused imports and test issues.

  2. Depends on #825: Shares user_extensions.py and health-based status code. #825 needs to merge first, but #825 itself has a merge conflict from the liquid metal migration (#829).

  3. Merge conflict with main: Extensions.jsx was migrated to theme variables in #829. Any frontend changes need to use bg-theme-card, text-theme-text-muted, text-theme-accent-light, etc. instead of hardcoded zinc/indigo. See the mapping in my comment on #825.

Merge order: #825 (rebased) → #827 (rebased) → #828

Question: The frontend timeout (300s per the description) vs backend subprocess timeout (600s) — if the frontend gives up at 300s but the backend is still pulling a 44GB image, what happens? Does the progress file stay in "pulling" state forever? Is there cleanup?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants