Phase 27: Watcher Service & User-Initiated Scan#59
Merged
SimplicityGuy merged 87 commits intoMay 14, 2026
Conversation
Phase 27 UI-SPEC.md: visual + interaction contracts for the Trigger Scan card, Scan Path Picker HTMX swap, Scan Progress poll partial, Recent Scans mini-table, and shared scan status pill. Inherits all design tokens (color, type, spacing) from the existing Phaze design system in templates/base.html and established class patterns; introduces no new visual primitives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Establishes the technical foundation for Phase 27 by mapping each CONTEXT.md decision to existing codebase patterns (PhazeAgentClient, AgentTaskRouter, Phase 26 D-08 cross-tenant guard, HTMX poll-halt) and surfacing the watchdog-asyncio thread bridge as the central pattern. Documents mtime-stability landmines across rsync/cp/wget/editor write patterns plus the Validation Architecture for Nyquist Dimension 8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
STATE.md frontmatter said milestone: v3.0 but milestone_name was already "Distributed Agents" (the v4.0 milestone). v3.0 (Cross-Service Intelligence & File Enrichment) shipped 2026-04-04. Fix the version to match the active milestone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add watchdog>=4.0 to [project].dependencies (resolves to 6.0.0) - Extend AgentSettings with four new fields per D-03 / D-11: watcher_settle_seconds (10), watcher_max_pending_seconds (3600), watcher_sweep_interval_seconds (2), scan_chunk_size (500) - Each field wired via AliasChoices(PHAZE_WATCHER_* / PHAZE_SCAN_CHUNK_SIZE) - Add parametrized defaults + env-var alias tests in test_config_role_split.py Wave 0 foundation for Phase 27 watcher service.
- New phaze.tasks._shared.agent_bootstrap module exports _WHOAMI_BACKOFF_S, construct_agent_client, whoami_with_retry - Refactor agent_worker.py to import from _shared (preserves _whoami_with_retry back-compat alias so internal call sites are unchanged) - Tighten whoami_with_retry: short-circuit on AgentApiAuthError (Pitfall 7) with operator-actionable "auth invalid; check PHAZE_AGENT_TOKEN" hint and ERROR-level log; no backoff entries consumed before short-circuit - T-27-04 mitigation: cleartext token never escapes construct_agent_client - Update test_agent_startup_banner.py to patch construct_agent_client and the shared module's _WHOAMI_BACKOFF_S (test target moved with the function) - Add 5 new tests in test_shared_agent_bootstrap.py covering all 4 behaviors + token-leak guard D-17 import-boundary invariant: shared module imports only phaze.config, phaze.services.agent_client, phaze.schemas.agent_identity (no Postgres stack).
- New tests/test_agent_watcher/ package marker + conftest with three
fixtures (tmp_watcher_root, fake_clock, mock_api_client) so Plan 05 can
write test_debouncer/test_observer/test_main with zero scaffolding cost
- Extend tests/test_task_split.py with two new subprocess-isolated cases:
* test_agent_watcher_does_not_import_phaze_database -- Phase 27 D-22 +
Pitfall 5 extension; forbidden tuple adds phaze.tasks.agent_worker
(watcher uses asyncio.run, not SAQ; dragging in agent_worker would
require PHAZE_AGENT_QUEUE). Skipped pre-Plan-05 via importlib.util
find_spec predicate; becomes a hard gate when Plan 05 creates the
phaze.agent_watcher package.
* test_shared_bootstrap_stays_postgres_free -- enforces D-17 invariant
on phaze.tasks._shared.agent_bootstrap; passes immediately (no DB stack
in the import graph).
Existing agent_worker subprocess case continues to pass (no regression).
- 3 tasks complete (watchdog dep + AgentSettings knobs; shared agent bootstrap with Pitfall 7 short-circuit; test scaffolding + boundary tests) - 14 plan-scoped tests pass + 1 conditional skip (waits for Plan 05) - No regressions in existing test suite; all quality gates green
fileConfig() default disable_existing_loggers=True silently kills every Python logger not listed in alembic.ini (only root/sqlalchemy/alembic). After any test in tests/test_migrations/ runs, all phaze.* loggers are disabled and pytest caplog cannot capture from them β which surfaced as test_whoami_with_retry_short_circuits_on_auth_error failing only in the full suite, never in isolation. Why: same-process test pollution from migration tests breaks caplog for any subsequent test that asserts on a logged message. Setting disable_existing_loggers=False keeps the alembic logger config additive, which is the canonical Alembic recommendation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Extend FileUpsertChunk with batch_id: uuid.UUID | None = None - When present, binds chunk to specific ScanBatch (Plan 03 wires resolver) - When absent, controller resolves agent's LIVE sentinel batch - Drop unused `from __future__ import annotations` (pydantic needs uuid at runtime to build the validator; bare runtime import matches sibling schemas) - All 5 behaviors covered: default None, explicit UUID, non-UUID rejection, extra="forbid" preserved for unknown fields, JSON schema exposes uuid|null - Phase 25/26 callers continue to validate (additive optional field) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New module phaze.schemas.agent_scan_batches:
* ScanBatchPatch (request body): four optional fields
(total_files, processed_files, status, error_message) with
extra="forbid"; status is Literal["running","completed","failed"]
β "live" is intentionally absent (D-10 schema-layer guard on the
watcher's terminal sentinel state)
* ScanBatchPatchResponse (echo body): full row echo per D-Discretion Β§4
β saves the agent a follow-up GET; loose `status: str` mirrors the
sibling ExecutionLogPatchResponse shape
- 9 tests cover: running/completed/failed acceptance, live + garbage
rejection, optional-progress-counts, no-ge-on-ints, extra=forbid,
empty-body validates, response-row-echo, JSON schema Literal alts
exclude "live"
- Module is contract-only; Plan 03 wires the endpoint and cross-tenant
guard (T-27-02 mitigation lives at the router layer).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
β¦, D-06)
- Append ScanDirectoryPayload to phaze.schemas.agent_tasks (after
ScanLiveSetPayload): three fields (scan_path, batch_id: uuid.UUID,
agent_id) with extra="forbid". Carries the per-job snapshot for the
agent's scan_directory SAQ task (D-23 invariant: agent never reads
state back from the controller mid-job).
- New module phaze.schemas.pipeline_scans:
* TriggerScanForm β operator-submitted form body for POST
/pipeline/scans. Three fields (agent_id, scan_root, subpath=""),
extra="forbid". Semantic validation (NFC + prefix + .. rejection)
happens at the router layer (T-27-03 disposition).
- 9 new tests: 5 for ScanDirectoryPayload (minimal valid, non-UUID
rejection, extra-forbid, field-set, no-models/current-path) + 4 for
TriggerScanForm (default-empty subpath, explicit subpath, extra-forbid,
required-fields). Existing invariant tests extended to include the new
ScanDirectoryPayload class.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks shipped: - FileUpsertChunk.batch_id optional field (D-09) - ScanBatchPatch + ScanBatchPatchResponse module (D-10; LIVE excluded at schema layer) - ScanDirectoryPayload (D-14) + TriggerScanForm (D-06) 45 schema tests pass; 11 Phase 25/26 router tests pass (no regression); ruff + ruff-format + mypy all clean. Four auto-fixes documented inline (ruff TC003 false-positive resolved by dropping __future__ annotations in agent_files.py; docstring grep collision adjusted; ruff I001 auto- fixed import order; JSON-schema LIVE-exclusion promoted to standalone test for D-10 regression coverage). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
β¦client method (D-10, T-27-01)
- New router phaze.routers.agent_scan_batches with handler ordering:
404 (unknown) -> 403 (cross-tenant, T-27-01) -> 422 (status='live' via
Literal) -> 200 idempotent same-state echo -> 409 (illegal transition) ->
200 applied. Cross-tenant guard mirrors agent_proposals.py:62-76
byte-for-byte so a leaked batch_id cannot be probed via 409 timing.
- _SCAN_TRANSITIONS dict gates RUNNING -> {COMPLETED, FAILED}; LIVE
intentionally absent (watcher's terminal sentinel).
- Same-state PATCH with no other set fields is a zero-DB-write echo --
matches Phase 26 D-08 invariant (no updated_at bump).
- PhazeAgentClient.patch_scan_batch inherits the tenacity retry funnel +
AgentApiError hierarchy via _request; sends model_dump(exclude_unset=True)
so default-None fields don't clobber server-side state.
- 11 router contract tests cover all branches; Test 9 specifically asserts
403 (not 409) when agent B PATCHes agent A's COMPLETED batch -- proves
the cross-tenant check runs BEFORE state-machine eval.
- 1 respx client test verifies URL, exclude_unset wire body, and response
model validation.
β¦9/D-18/D-21, T-27-02)
- Insert a resolution block BEFORE the records loop in agent_files.upsert_files:
* body.batch_id present -> session.get(ScanBatch, id); 404 if missing;
403 if batch.agent_id != caller.id (T-27-02). Mirrors the Phase 26
D-08 cross-tenant guard placement byte-for-byte.
* body.batch_id absent -> SELECT ScanBatch.id WHERE agent_id=? AND
status='live'. The Phase 24 partial UQ uq_scan_batches_agent_id_live
guarantees exactly one row exists for any registered agent, so
.scalar_one() is safe.
- Stamp `data["batch_id"] = resolved_batch_id` on every record alongside the
existing AUTH-01 `agent_id` stamp; the existing upsert SET clause already
copies excluded.batch_id, so the field flows through atomically.
- Auto-enqueue path is untouched -- SCAN-02 invariant preserved (Test 5
verifies extract_file_metadata still fires for new INSERTs).
- 5 new contract tests cover all branches; Test 3 explicitly asserts ZERO
FileRecord rows insert when a cross-tenant 403 fires (atomicity proof,
T-27-02 mitigation).
- Existing test_agent_files.py fixture now seeds the LIVE sentinel
(Phase 24 D-11 invariant; pre-Phase-27 fixtures pre-date the
agent-registration side effect, so we add it here to keep the Phase 25/26
contract behaviorally unchanged).
- Import phaze.routers.agent_scan_batches in main.py (alphabetic order between agent_proposals and agent_tracklists). - app.include_router(agent_scan_batches.router) added in the Phase 26 internal-agent block; Plan 06 will land pipeline_scans.router after. - New non-async test test_router_registered_in_main_app asserts the path prefix /api/internal/agent/scan-batches is reachable on the production create_app() app (NOT just the smoke-app fixture) and that a PATCH method is bound there. This closes the wiring acceptance gap.
Records the Wave 2 controller landing: PATCH /scan-batches/{batch_id} +
batch_id resolution on POST /files + PhazeAgentClient.patch_scan_batch +
main.py wiring. 991 tests passing, no regression, 17 new tests, 3 atomic
commits, 0 non-trivial deviations from the agent_proposals.py mirror.
β¦(D-11..D-13)
Walks scan_path on the agent host, SHA-256s each known-extension file via
asyncio.to_thread, POSTs FileUpsertChunk(batch_id=...) of 500 records via
ctx['api_client'].upsert_files, and PATCHes the ScanBatch's processed_files
after each chunk. Terminal PATCH carries status='completed' + total_files=N
on a clean walk, or status='failed' + error_message on a missing scan_path
or AgentApiServerError after retries.
Mitigations encoded:
- Pitfall 3 (NFC drift): unicodedata.normalize("NFC", ...) applied to all
three path fields (original_path, original_filename, current_path).
- Pitfall 4 (symlink traversal): os.walk(scan_root, followlinks=False).
- D-12 mid-walk OSError: per-file try/except logs a warning and continues.
- AUTH-01: scan_directory NEVER stamps agent_id or id -- the controller
resolves both from the bearer token.
- D-13 + Phase 26 D-25: NO imports of phaze.database, phaze.models,
phaze.services.ingestion, or sqlalchemy. Helper _classify duplicates
the EXTENSION_MAP lookup logic so we avoid importing services.ingestion
(which transitively imports phaze.models).
11 unit tests cover: extension filter, exact 500/500/1 chunking at 1001
files, monotonic per-chunk PATCH counts, terminal completed PATCH,
terminal failed PATCH on missing path, OSError skip, NFC normalization,
agent_id/id omission, batch_id propagation on every chunk, symlink
non-traversal, extra-kwargs ValidationError.
12th test (registration) is for Task 2 and is intentionally failing here.
Adds scan_directory to the SAQ worker's functions list so AgentTaskRouter can enqueue it by name on the per-agent queue (Phase 27 D-13). Import ordering follows alphabetic (scan_directory before scan_live_set). Placed between scan_live_set and execute_approved_batch per 27-PATTERNS.md line 642 -- keeps the scan-family tasks contiguous. The Phase 26 D-25 import-boundary invariant (no phaze.database / phaze.models / sqlalchemy in agent_worker's transitive import graph) is preserved: scan.py's new scan_directory uses only phaze.config, phaze.constants, phaze.schemas.*, phaze.services.hashing, and phaze.services.agent_client -- all Postgres-free. Verified by tests/test_task_split.py::test_agent_worker_does_not_import_phaze_database. The previously-deselected registration test (tests/test_tasks/test_scan_directory.py::test_scan_directory_registered_in_agent_worker_settings) now passes -- closes the 12th test in that file.
Wave 3 first task: implement the three asyncio-side primitives for the
always-on watcher.
- Debouncer: dict[str, _PendingEntry] state machine driven by
time.monotonic(); touch() inserts/refreshes; sweep() returns ready
paths after settle_period and evicts stuck paths after max_pending
(D-02 cap, T-27-05 mitigation). Snapshot iteration via list(...) is
the Pitfall-2 safe-mutation pattern.
- WatcherEventHandler: watchdog -> asyncio bridge. Subscribes to
FileCreatedEvent + FileModifiedEvent only (D-01); filters by
EXTENSION_MAP for MUSIC/VIDEO categories (SCAN-03); NFC-normalizes
src_path (Pitfall 3); dispatches via loop.call_soon_threadsafe -- the
only sanctioned cross-thread primitive (Pitfall 2). Accepts both str
and bytes src_paths from watchdog with a graceful drop on undecodable
byte sequences.
- Poster: chunk-of-1 POST adapter. Stats + SHA-256 off-loop via
asyncio.to_thread; vanished-path OSError dropped at DEBUG (Pitfall 1);
FileUpsertChunk(files=[record]) omits batch_id so the controller
resolves the LIVE sentinel from the bearer token (D-18); all three
path fields NFC-normalized; AgentApi{Client,Server,}Error all logged
via logger.exception (never re-raised) so a single record failure
cannot crash the sweep loop.
10 unit tests green (5 debouncer + 5 observer). Thread-bridge invariant
verified: test_event_handler_uses_call_soon_threadsafe asserts touch()
is NEVER invoked directly on the test thread.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SAQ 0.26.3's Worker.__init__ does not accept timeout, retries, or keep_result -- they are per-Job settings. Passing them through the settings dict broke `saq phaze.tasks.controller.settings` (and the agent_worker equivalent) on a fresh docker compose stack with TypeError. Drop the three keys from both settings dicts; preserve the project's policy defaults (600s timeout, 4 retries, 3600s ttl) via a Queue-level before_enqueue hook in phaze.tasks._shared.queue_defaults that applies them only when the Job is still at its SAQ default (preserving caller-supplied per-job overrides). Regression tests: - test_before_enqueue_applies_project_defaults - test_before_enqueue_preserves_explicit_overrides - test_controller_settings_construct_real_worker (would have caught the original TypeError -- now passes) - test_agent_worker_settings_construct_real_worker (same, for the agent-side dict) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
β¦artup GAP-2 (migrations): the api lifespan only opened the engine for a SELECT 1 connectivity check -- it never ran `alembic upgrade head`. On a fresh docker compose stack the agents/files tables did not exist and every request 500'd. Wire `phaze.database.run_migrations` into the lifespan BEFORE the engine SELECT 1 so the schema is at head before any router runs. Idempotent + gated by the new `settings.auto_migrate` knob (env: PHAZE_AUTO_MIGRATE, default true). GAP-3 (seed dev agent): migration 012 seeds the legacy agent ONLY when backfilling a populated v3.0 files table. On a fresh DB no agent exists, so the watcher's /whoami returns 403 and the container restart-loops. Add `phaze.services.agent_bootstrap.ensure_dev_agent` -- on an empty agents table it seeds a single `dev-agent` row with a sha256'd bearer (either operator-supplied via PHAZE_DEV_AGENT_TOKEN or freshly random). The cleartext bearer is logged once at INFO so the operator can copy it into the watcher's .env. Gated by `settings.dev_seed_agent` (env: PHAZE_DEV_SEED_AGENT, default false). Regression tests: - test_run_migrations_invokes_alembic_upgrade_head - test_run_migrations_is_idempotent - test_run_migrations_skips_when_auto_migrate_false - test_api_lifespan_runs_migrations_on_startup (verifies the call-order invariant: run_migrations BEFORE engine.begin BEFORE ensure_dev_agent) - test_ensure_dev_agent_seeds_when_table_empty - test_ensure_dev_agent_noop_when_agent_exists - test_ensure_dev_agent_uses_env_token_when_set - test_ensure_dev_agent_disabled_in_prod Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
.env.example previously documented the four optional PHAZE_WATCHER_* tunables but NOT the three required agent-mode vars (PHAZE_AGENT_API_URL, PHAZE_AGENT_TOKEN, PHAZE_AGENT_SCAN_ROOTS) nor the host-vs-container hostname distinction (postgres/redis service DNS when in docker compose vs localhost when running on host via uv). Add explicit sections for: - Host vs Container hostname rule (callout at the top) - Gap 2/3 bring-up knobs (PHAZE_AUTO_MIGRATE, PHAZE_DEV_SEED_AGENT, PHAZE_DEV_AGENT_TOKEN) - Required agent-mode env vars with example values and operator notes Regression tests: - test_env_example_documents_all_required_agent_mode_vars - test_env_example_documents_auto_migrate_and_dev_seed - test_env_example_explains_host_vs_container Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the watcher died with a raw pydantic ValidationError stack
trace when PHAZE_AGENT_API_URL (or another required AgentSettings field)
was missing. The operator-facing Pitfall-7 hint
("auth invalid; check PHAZE_AGENT_TOKEN") emitted by whoami_with_retry
was never reached because the validator tripped first.
Wrap the get_settings() call in main() with try/except ValidationError.
On failure, emit one ERROR log per failed field (with the field name and
its mapped env-var name like PHAZE_AGENT_API_URL), log the full pydantic
exception at DEBUG for troubleshooting, then sys.exit(1) so docker
compose restart-cycles with a meaningful logline.
Regression test:
- test_main_logs_actionable_error_on_missing_env: monkeypatches env to
remove PHAZE_AGENT_API_URL, asserts the ERROR log mentions the var by
name AND uses the "missing"/"required" keywords AND exit code is 1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The agent_watcher README documented env vars but lacked a sequenced bring-up walkthrough. Operators bringing up a fresh docker compose stack had to puzzle out the order of api startup + dev-agent seeding + token copy + watcher startup themselves. Add a "Fresh Install Quickstart" section that walks through the entire flow end-to-end: - copy .env.example, host-vs-container hostname rule - enable PHAZE_DEV_SEED_AGENT, pick a token - bring up postgres + redis, then api + worker (migrations + seeding happen automatically in the api lifespan) - bring up the watcher and verify with `docker logs watcher` - production checklist for disabling the dev-seed path Docs only; no test required. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After Phase 27 UAT Gap 2 / Gap 3 wired `run_migrations` and `ensure_dev_agent` into the api lifespan, the pre-existing Phase 4 gap tests (test_lifespan_creates_queue_on_startup and test_lifespan_disconnects_queue_on_shutdown) started failing because the lifespan now opens a real DB connection before reaching the Queue/engine mocks. Patch the new entry points (run_migrations, ensure_dev_agent, async_session) so these tests stay unit-level. No behavioural change -- only test plumbing alignment with the new lifespan order documented in test_main_lifespan.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the 6 commits, what each fixed, the regression test that would have caught the original bug, and the auxiliary lifespan-test fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The initial Gap 3 fix used `count(*) > 0` to detect an "already populated" agents table, but Migration 012 inserts a `legacy-application-server` row with `revoked_at=NOW()` and `token_hash=NULL` as a marker. That row cannot authenticate, so on a fresh DB the watcher still has no usable agent β but the naive count check would no-op. Refine the check to count USABLE agents (`revoked_at IS NULL AND token_hash IS NOT NULL`). Production migrations from v3.0 data leave the legacy row as a revoked marker; the dev-seeder now correctly seeds past it. Test that would have caught this: the new `test_ensure_dev_agent_seeds_past_revoked_legacy_marker` deletes the tokenless conftest legacy, inserts a production-shaped revoked legacy, then asserts ensure_dev_agent still seeds a usable dev-agent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 27's watcher runs via `asyncio.run(main())` and never goes through uvicorn's logging configuration. Without an explicit handler, EVERY logger.info/error/etc call in the watcher (startup banner, sweep warnings, post failures, evictions) was swallowed β operators saw an empty `docker logs phaze-watcher-1` even when the process was healthy and posting files. A healthy watcher was indistinguishable from a hung one. Add `_configure_logging()` at the top of main() that attaches a single stdout StreamHandler to the root logger and sets root level to INFO. Idempotent: re-running adds no duplicate handler. Test that would have caught this: `test_configure_logging_attaches_stdout_handler` resets root handlers, invokes the function, asserts exactly one stdout StreamHandler is present and root level <= INFO. Also asserts idempotency via a second invocation. Surfaced during Phase 27 UAT live bringup β the watcher container was "Up 38 seconds" with zero log lines, leaving us unable to tell whether it was working or stuck. This is the seventh gap closed in the UAT loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
watchdog's native Observer relies on inotify on Linux, but macOS docker bind mounts (rancher-desktop / Docker Desktop) do NOT propagate inotify events through 9p/virtiofs. The watcher's Observer schedules without error but never fires β files are visible inside the container but no events reach the WatcherEventHandler. Add an opt-in `PHAZE_WATCHER_POLLING_MODE` config that swaps the native Observer for watchdog's PollingObserver. Native remains the default so production Linux deployments keep their efficient inotify backend; macOS devs running UAT via docker compose set the env var to work around the bind-mount limitation. Tests that would have caught the wiring bug: - `test_main_uses_polling_observer_when_flag_set` asserts PollingObserver is constructed and native Observer is NOT touched when the flag is true. - `test_main_uses_native_observer_by_default` asserts the default path uses the native Observer (no Polling). .env.example documents the new knob with the macOS context. Surfaced during Phase 27 UAT β eighth gap in the live-bringup loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migration 012 explicitly seeds a LIVE-sentinel ScanBatch for the `legacy-application-server` agent so POST /api/internal/agent/files can resolve `batch_id=None` via the `uq_scan_batches_agent_id_live` partial unique index. The dev-seeder created the agent but skipped this step, so the watcher's chunk-of-1 upserts hit `scalar_one()` and crashed with `sqlalchemy.exc.NoResultFound: No row was found when one was required`. Add a `ScanBatch(agent_id=dev-agent, scan_path='<watcher>', status='live')` insert immediately after the Agent insert. Test that would have caught this: extended `test_ensure_dev_agent_seeds_when_table_empty` now asserts the LIVE sentinel ScanBatch exists with the canonical `<watcher>` scan_path marker after seeding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In docker-compose mode SCAN_PATH is the HOST filesystem path used as the bind-mount source (e.g. /Users/Robert/phaze-watch-test), while PHAZE_AGENT_SCAN_ROOTS is the IN-CONTAINER path the agent's watcher walks (e.g. /data/music). The original seeder copied settings.scan_path into the dev-agent's scan_roots column, which wrote the host path β the watcher then tried to schedule a watchdog Observer on the host path from inside the container and crashed with FileNotFoundError. Read PHAZE_AGENT_SCAN_ROOTS directly from os.environ (comma-split, matching AgentSettings._split_scan_roots). Fall back to settings.scan_path only when the agent env var is unset. Test that would have caught this: test_ensure_dev_agent_uses_phaze_agent_scan_roots_env_when_set sets both vars to different values and asserts the agent row gets the PHAZE_AGENT_SCAN_ROOTS value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related fixes: 1. **Tailwind SRI mismatch (Gap 11):** base.html pinned the Tailwind CDN URL to @4 (major-version-only). jsdelivr silently ships newer 4.x point releases under that URL, and the previously-pinned SRI hash stops matching. Browsers BLOCK script execution on SRI mismatch, so Tailwind never loads and the entire admin UI renders unstyled. Pin to @4.3.0 with a matching SRI computed against the current served content. 2. **Test env isolation:** the project's docker-compose .env now defines runtime overrides like PHAZE_WATCHER_POLLING_MODE=true and PHAZE_WATCHER_SETTLE_SECONDS=3. pydantic-settings reads .env files into every BaseSettings() construction, which silently changed which code path tests exercised. Add an autouse conftest fixture that points BaseSettings classes at env_file=None for the test session and delenv's known toggle vars so neither .env nor a developer's shell env can leak in. Tests added (would have caught Gap 11): - test_every_cdn_script_pins_a_specific_version β static check that SRI-protected URLs in base.html aren't unpinned (e.g. @4 alone). - test_cdn_sri_hashes_match_served_content β network-using check that every pinned SRI hash matches what the CDN currently serves. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Alpine v3 does NOT process `:class` on the <html> element unless <html>
carries an x-data directive (Alpine's scanner starts at <body>). The
previous binding `<html :class="$store.theme.dark ? 'dark' : ''">` was
silently inert: clicking the toggle button mutated `$store.theme.mode`
but the <html> .dark class was never added or removed afterward β the
visual theme was permanently stuck at whatever the pre-flash script
chose on initial load.
Fix:
- Drop the inert `:class` binding from <html>.
- Move dark-class application into a single function `_applyTheme(mode)`
that flips `document.documentElement.classList.toggle('dark', ...)`
directly.
- Call it from three places: the pre-flash IIFE (first paint), the
Alpine store's `set()` method (toggle click), and a
`prefers-color-scheme` media query change listener (OS-level switch
while in `auto` mode).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Alembic-migrated postgres schema declares scan_batches.created_at as TIMESTAMP WITH TIME ZONE, so asyncpg materializes it as a tz-aware datetime. `_elapsed_seconds` did `datetime.now(UTC).replace(tzinfo=None) - batch.created_at` which crashed with `TypeError: can't subtract offset-naive and offset-aware datetimes`. The scan_progress endpoint returned 500 and the admin UI's polling card went blank during UAT Test 2. Surfaced because the test suite hides the divergence β SQLAlchemy's create_all generates TIMESTAMP WITHOUT TIME ZONE columns, so loaded ScanBatch rows in tests were tz-naive and the subtraction worked. Production schema differs from test schema. Fix: compare aware-to-aware. If `created_at` is unexpectedly tz-naive (test fixtures that bypass the DB), treat it as UTC so the helper returns a meaningful value either way. Tests that would have caught this (regardless of test/prod schema divergence): - `test_elapsed_seconds_handles_tz_aware_created_at` constructs an aware datetime in Python and calls the helper directly. - `test_elapsed_seconds_handles_tz_naive_created_at_as_utc` covers the defensive fallback path so test-fixture loaders keep working. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The user-initiated scan flow enqueues scan_directory + extract_file_metadata
onto the per-agent SAQ queue `phaze-agent-<agent_id>`, but Phase 27's
docker-compose.yml shipped only:
- `worker` β controller queue (PHAZE_ROLE=control)
- `watcher` β filesystem observer (no SAQ consumer)
so jobs sat in Redis with status="queued" forever. The UI's scan_progress
card polled correctly (gap-12 β) but `total_files` stayed 0 and the card
never transitioned to COMPLETED -- breaking 27-UAT Test 2's "terminal halt"
contract.
Phase 26 D-04's comment scheduled the agent-side worker for Phase 29's
docker-compose.agent.yml overlay, but Phase 27 UAT requires it today.
Fix:
- New `agent-worker` service in docker-compose.yml running
`uv run saq phaze.tasks.agent_worker.settings` with PHAZE_ROLE=agent.
Binds to PHAZE_AGENT_QUEUE=phaze-agent-dev-agent (the dev seeder's
agent_id). Will be parameterized in Phase 29.
- Defer essentia import in `phaze.tasks.functions`: move
`from phaze.services.analysis import analyze_file` into a function-scoped
loader. essentia-tensorflow is gated out of linux-arm64 by pyproject.toml's
environment markers; the eager import made agent_worker's module load fail
on Apple Silicon even though scan_directory / extract_file_metadata never
touch essentia. process_file behavior on x86_64 is unchanged -- the loader
is called at runtime when CPU-bound analysis is dispatched to the process
pool.
Regression test (`tests/test_phase04_gaps.py`):
- `test_docker_compose_has_agent_worker_consuming_agent_queue` parses
docker-compose.yml and asserts at least one service runs
`saq phaze.tasks.agent_worker.settings` with PHAZE_ROLE=agent.
Live-verified on rancher-desktop / linux-arm64: scan_directory(batch_id=2b7e...)
went from status="queued" to status="complete" within seconds of the new
service coming up; GET /pipeline/scans/{id} now returns the COMPLETED
partial with no hx-trigger and no hx-get (polling halts as designed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
β¦g of gap-12)
Gap-12 patched `pipeline_scans._elapsed_seconds` to compare aware-to-aware
when computing elapsed scan time, but `pipeline.dashboard` (the Recent
Scans table loader) carried an inline duplicate of the pre-gap-12
antipattern:
now = datetime.now(UTC).replace(tzinfo=None)
batch._elapsed_seconds = int((now - batch.created_at).total_seconds())
The duplicate did not surface during gap-12 because the dashboard had no
non-LIVE ScanBatch rows to walk. Once gap-13 brought up the agent-worker
and the first user-initiated scan completed, the Recent Scans loop hit a
real tz-aware `created_at` from postgres and the entire dashboard route
500'd:
TypeError: can't subtract offset-naive and offset-aware datetimes
at src/phaze/routers/pipeline.py:157
User saw the page as "inaccessible" with an empty Recent Scans table.
Fix:
- Promote `_elapsed_seconds` β `elapsed_seconds` in pipeline_scans.py
(now a public shared helper). One definition, one tested behavior.
- `pipeline.dashboard` imports and calls `elapsed_seconds` instead of
re-implementing the math inline.
- Drop the now-unused `datetime` / `UTC` imports in pipeline.py.
Regression test (`tests/test_routers/test_pipeline_scans.py`):
- `test_no_router_uses_tz_naive_now_antipattern` walks the router package
AST and fails on any `datetime.now(...).replace(tzinfo=None)` pattern.
Catches gap-12, gap-14, and any future sibling instance in one rule.
- Existing `_elapsed_seconds` tests updated to import `elapsed_seconds`
(public rename).
Live-verified: GET /pipeline/ now returns 200 with the Recent Scans table
populated (1 row -- the completed dev-agent /data/music scan with the
green COMPLETED pill).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
β¦ing-up) All three Phase 27 UAT tests are PASS: 1. End-to-end file drop β FileRecord under LIVE batch (9 gaps closed) 2. Admin UI scan trigger β progress polling β terminal halt (1 gap closed: gap-13) 3. Visual layout verification of admin UI (1 gap closed: gap-14) The remaining 3 gaps (10-12) landed between Test 1 and Test 2 and closed prerequisites for the polling-card behavior. Full gap inventory: gap-1 SAQ Worker kwargs (Phase 26 spillover) gap-2/3 Auto-migrate + ensure_dev_agent at api startup gap-4 .env.example required vars + host/container guidance gap-5 Surface readable error on missing watcher env gap-6 Watcher fresh-install quickstart gap-7 Watcher stdout logger gap-8 PollingObserver for macOS bind mounts gap-9 Seed LIVE-sentinel ScanBatch alongside dev-agent gap-10 Dev-seeder prefers PHAZE_AGENT_SCAN_ROOTS gap-11 Tailwind SRI mismatch + test env isolation gap-12 scan_progress 500 on tz-aware created_at gap-13 docker-compose missing agent-worker service gap-14 Dashboard 500 on tz-aware created_at (sibling of gap-12) Each gap was committed atomically with a regression test that would have caught the original bug. UAT status flipped from `testing` to `complete`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 27's verifier left status=human_needed because three of its checks require a browser + live docker stack + real-time settle timer. All three have now been performed and passed against the docker-compose bring-up on rancher-desktop / linux-arm64 (see 27-HUMAN-UAT.md): 1. End-to-end file drop β FileRecord under LIVE batch (PASS, 9 gaps closed) 2. Admin UI scan trigger β polling β terminal halt (PASS, gap-13 closed) 3. Visual layout verification (PASS, gap-14 closed) Promotes status human_needed β pass. Unlocks /gsd-ship 27. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
threats_open: 0 β all 24 plan-time threats verified CLOSED or accepted.
gsd-security-auditor audited the post-UAT state of the branch (after the
14 `fix(27-uat-gaps):` commits) to confirm plan-time mitigations survived
the UAT churn unchanged. One mitigation was hardened during UAT (T-27-03
substring `if ".." in joined` upgraded to component-level
`PurePosixPath.parts` check per WR-01); all others verified by grep gates
+ live test invocation against the current tree.
Accepted risks documented:
AR-27-01 T-27-07 CSRF deferred to Phase 29 (private-LAN single-operator)
AR-27-02 Concurrent overlapping scans (idempotent UQ absorbs)
AR-27-03 Watcher catch-up on startup (PROJECT.md scope lock for v4.0)
AR-27-04 Dev-seed bearer cleartext in API logs (gap-3; intentional dev
path, gated on empty agents table + dev_seed_agent=True; never
triggers in production)
Closes the security gate ahead of /gsd-ship 27.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
STATE.md updated to mark Phase 27 as shipped. Milestone v4.0 progress moves from 67% (4/6 phases, 26/33 plans) to 83% (5/6, 33/33 plans). Two phases remain in v4.0 -- Phase 28 (Distributed Execution Dispatch) and Phase 29 (Operational Hardening per CONTEXT Β§ Deferred Ideas). PR: #59 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Reportβ Patch coverage is
π’ Thoughts on this report? Let us know! |
Closes 16/46 Codecov-flagged uncovered lines surfaced on PR #59. Lines closed: agent_watcher/observer.py:64,68-70,90 (5 lines) β 100% agent_watcher/poster.py:94-99 (6 lines) β 100% routers/agent_scan_batches.py:99 (1 line) β 100% tasks/_shared/agent_bootstrap.py:105-107 (3 lines) β 100% tasks/scan.py:82 (1 line) β 91.86% New tests (9): test_event_handler_drops_empty_src_path test_event_handler_drops_path_when_fsdecode_raises test_event_handler_ignores_directories_in_on_modified test_post_one_swallows_agent_api_error_branches (parametrized: 4xx/5xx/catch-all) test_whoami_with_retry_short_circuits_on_auth_error_in_final_attempt test_resolve_chunk_size_falls_back_when_not_agent_settings test_defensive_live_409_when_literal_bypassed Each test pins a defensive branch a future refactor might silently bypass: the fsdecode failure log, the directory-event guard on on_modified, each of the three AgentApi* drop paths in Poster, the Pitfall-7 hint surfacing when a token rotates mid-bootstrap, the ControlSettings fallback for _resolve_chunk_size, and the defensive LIVE-status 409. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
β¦lopes Closes 12 more of the Codecov-flagged lines from PR #59. Two files formerly at 90-92% now sit at 100%. Lines closed: routers/pipeline_scans.py:120, 207, 255-260 (5 lines) β 100% tasks/scan.py:212-225 (7 lines) β 100% New tests (5): test_scan_directory_aborts_with_failed_patch_on_server_error β 5xx upsert during walk: abort, terminal failed-PATCH succeeds, return shape pinned to {status:"failed", reason:"controller_5xx"}. test_scan_directory_terminal_failed_patch_also_fails β same as above, but the terminal failed-PATCH ALSO 503s. Verifies the inner-except suppression: no second exception escapes, the return envelope still surfaces, and the "terminal failed-PATCH also failed" log message fires for triage. test_get_scan_progress_unknown_id_returns_404 β GET /pipeline/scans/{unknown_uuid} β 404 "scan batch not found". test_post_scans_prefix_mismatch_via_direct_handler_invocation β defensive prefix-check branch (line 207) is structurally unreachable under normal inputs because the literal-membership check dominates and well-formed joined paths always prefix-match. Monkeypatches unicodedata.normalize to rewrite the joined path out from under the predicate, simulating a hypothetical future normalization edge case. Pins the 400 envelope. test_post_scans_enqueue_failure_with_secondary_commit_also_failing β WR-06 inner-except: when Redis-down causes the enqueue to fail AND a Postgres-down kills the secondary commit, the handler MUST still return the 503 envelope (no 500 escape). Verifies the inner try/except log, the session.rollback() call, and the 503 envelope copy. Total Codecov gap progress: 28/46 lines closed across 3 commits (batch-A 16, batch-B 12). Remaining: agent_watcher/__main__.py 18 lines (sweep loop, role guard, signal fallback, __name__ entry). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the last batch of Codecov gaps from PR #59. agent_watcher/__main__.py goes from 79.55% to 98.86% β only line 245 (the `if __name__ == "__main__":` entry-point bootstrap) is left, and that's intractable to test directly. Lines closed: __main__.py:105-118 sweep_loop full body + inner post-failure path __main__.py:114-115 outer-except wrapping sweep iteration failures __main__.py:163-164 wrong-role guard (PHAZE_ROLE != agent) __main__.py:196-201 NotImplementedError signal-handler fallback New tests (4): test_sweep_loop_posts_ready_logs_evicted_then_exits test_sweep_loop_outer_except_swallows_sweep_failure test_main_raises_when_settings_is_not_agent_settings test_main_swallows_signal_handler_not_implemented Codecov gap progress (PR #59): Initial 46 lines uncovered across 7 files (91.69% patch coverage) Batch-A 16 lines covered (4 files β 100%) Batch-B 12 lines covered (2 more files β 100%) Batch-C 17 lines covered (last file 79.55% β 98.86%) Final 1 line uncovered (the __main__ entrypoint; intractable) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SimplicityGuy
added a commit
that referenced
this pull request
May 14, 2026
Phase 27 (Watcher Service & User-Initiated Scan) merged into main on 2026-05-14 as commit 4efb4a4. Status flips shipped β ready_to_plan so the next phase (Phase 28 β Distributed Execution Dispatch) can be planned. Milestone v4.0 progress: 5/6 phases (83%), 33/33 plans complete. Phase 28 and Phase 29 remain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 27: Watcher Service & User-Initiated Scan
Goal: Each file server continuously streams new file arrivals to the application server, and the administrator can also trigger an explicit scan of any path on any agent from the admin UI.
Status: Verified β Β· Threat-secure β Β· UAT pass (3/3)
Ships the agent-side filesystem watcher and the operator-facing user-initiated scan flow. New
phaze-agent-watcher(filesystem observer usingwatchdog) andphaze-agent-worker(SAQ consumer for the per-agent queue) services. The/pipeline/admin UI gains a Trigger Scan card, an HTMX-polling scan-progress card withevery 2shalt-on-terminal, and a Recent Scans mini-table. End-to-end: drop a file under the watcher's root β settle for 10s βFileRecordappears bound to the agent's LIVE sentinel batch; trigger a scan from the UI β API enqueuesscan_directoryβ agent walks the tree β chunked POST β batch transitions to COMPLETED β polling halts.Changes
Plan 27-01: Watcher Foundation (Wave 0)
watchdogdep added;AgentSettingsgains watcher knobs (settle period, debounce, stuck-file cap, polling-mode flag)phaze.tasks._shared.agent_bootstrap(whoami-with-retry, construct_agent_client)tests/test_task_split.pyimport-boundary tuple (forbidsphaze.tasks.agent_workerfrom the watcher graph)Plan 27-02: Wire Schemas (Wave 1)
FileUpsertChunkgainsbatch_idfieldScanBatchPatch(Literal["running","completed","failed"] β LIVE excluded at the schema layer)ScanBatchResponse,ScanDirectoryPayload,TriggerScanFormextra="forbid"Plan 27-03: Controller HTTP API (Wave 2)
PATCH /api/internal/agent/scan-batches/{batch_id}with 403-before-state-machine cross-tenant guardPOST /api/internal/agent/filesaccepts optionalbatch_idwith the same 403-before-records-loop guardPhazeAgentClient.patch_scan_batchmethodPlan 27-04: scan_directory Task (Wave 3)
phaze.tasks.scan.scan_directory(scan_path, batch_id)β chunked HTTP-only directory walkos.walk(followlinks=False)per Pitfall 4; per-file OSError skip per D-12; NFC normalization on all path fields per Pitfall 3agent_worker.settings.functionsPlan 27-05: Watcher Package (Wave 3)
phaze.agent_watcher:Debouncer(3600s stuck-file eviction),WatcherEventHandler(cross-thread bridge viacall_soon_threadsafe),Poster(HTTP egress),__main__Plan 27-06: Admin UI (Wave 3)
routers/pipeline_scans.py: POST /pipeline/scans, GET /pipeline/scans/{id} (HTMX poll partial), GET /pipeline/scans/agent-roots (HTMX swap target)trigger_scan_card,scan_path_picker,scan_progress_card,recent_scans_table,scan_status_pill,scan_submit_error)..rejection + prefix validation againstagent.scan_roots)dashboard.htmlextended with the Trigger Scan card and Recent Scans sectionPlan 27-07: Deployment + Docs (Wave 5)
docker-compose.ymlwatcherservice (:romount;restart: unless-stopped).env.exampledocuments all required agent-mode vars + host/container hostname distinctionsrc/phaze/agent_watcher/README.mdRequirements Addressed
(agent_id, original_path)ingestion (no duplicates on re-walk)Verification
threats_open: 0β 20/24 mitigated, 4/24 accepted (27-SECURITY.md)UAT Gaps Closed During Live Bring-Up (14)
.env.examplemissing required agent-mode vars + host/container guidancescan_progress500 on tz-awarecreated_at(postgres TIMESTAMPTZ vs test schema divergence)phaze-agent-<agent_id>)created_at(sibling of gap-12)Each gap shipped as its own atomic
fix(27-uat-gaps): gap-Ncommit with a regression test that would have caught the original bug. Gap-14 also lands an AST-based regression test that forbids thedatetime.now(...).replace(tzinfo=None)antipattern in any router file.Key Decisions
docker-compose.agent.ymlin Phase 29; UAT Test 2 surfaced that Phase 27 requires the consumer to demonstrate the user-initiated scan reaching COMPLETED. Added in gap-13.elapsed_secondsis now a public shared helper: gap-14 promoted it from_elapsed_secondsinpipeline_scans.pyand consolidated the inline duplicate inpipeline.dashboard. Backed by an AST-based antipattern test.accept; router-layer ismitigate: legitimate subpaths need slashes/hyphens, so the regex would be over-restrictive; semantic validation belongs in the controller.π€ Generated with Claude Code