-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
WebSocket Infrastructure + Replace /poll with websocket + treat /poll as fallback if ws not available only #863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development
Are you sure you want to change the base?
WebSocket Infrastructure + Replace /poll with websocket + treat /poll as fallback if ws not available only #863
Conversation
frdel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, as discussed earlier, here are my comments. It works mostly well, we just need to cleanup the code and adjust a bit.
python/api/settings_get.py
Outdated
| async def process(self, input: dict, request: Request) -> dict | Response: | ||
| set = settings.convert_out(settings.get_settings()) | ||
| return {"settings": set} | ||
| runtime_info = {"isDevelopment": runtime.is_development()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not append manually constructed dictionaries to individual APIs and then expose them as global window objects. If we need runtime info to be available globally, we can have dedicated api endpoint or inject it to index.html like we do it with version.
The best approach here would probably be to add it to index.html below globalThis.gitinfo.
If we remove this, we can remove a lot of code from other files as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adressed, please check
python/helpers/log.py
Outdated
| self.logs: list[LogItem] = [] | ||
| self.set_initial_progress() | ||
|
|
||
| def has_items(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread locking for reads in log seems like a huge overkill that affects the convenience and requires changes in way too many files because of the field methods.
I agree to keep locks when updating, but:
We should consider what problem does this really solve - the only shared vars that can cause trouble are logs[] and updates[], we can use lock on those when updating.
When reading, I don't see a good reason to use locks, because we do multiple reads in sequence anyway, meaning in between another thread can still update the log and we can get different result between these two lines anyway (api_log_get.py):
"progress": context.log.get_progress(),
"progress_active": context.log.get_progress_active(),
Here the lock does not help, we would have to lock the whole sequence of reads.
Considering the worst case scenario here is that for a milisecond there can be something like new status bar with no new message yet or vice versa, I don't think it's worth the added complexity.
Please, keep thread lock only for updates and revert changes in all the files that are just reading the log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will inspect all locks when reading. The only thing we must keep in mind: if for example we iterate with a for loop over an iterable and it get modified (add/remove) mid-flight we would crash. But i will check all occurences and reduce lock contention as far as possible without risking edge-case crashes. I some cases an atomic copy is an alternative if we are no dependent on abolute temporal consistency for some reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adressed, please check
python/api/api_log_get.py
Outdated
| try: | ||
| # Get total number of log items | ||
| total_items = len(context.log.logs) | ||
| total_items = context.log.get_total_items() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment in log.py, we should not use read methods, that's an overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adressed, please check
python/api/chat_create.py
Outdated
|
|
||
| # New context should appear in other tabs' chat lists via state_push. | ||
| try: | ||
| from python.helpers.state_monitor import get_state_monitor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please either remove the "optional integration" or make a static function for this in some global helper, we call this try-except-else on too many places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adressed, please check
python/api/csrf_token.py
Outdated
| session["csrf_token"] = secrets.token_urlsafe(32) | ||
|
|
||
| # return the csrf token and runtime id | ||
| runtime_info = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not append manually constructed dictionaries to individual APIs and then expose them as global window objects. If we need runtime info to be available globally, we can have dedicated api endpoint or inject it to index.html like we do it with version.
The best approach here would probably be to add it to index.html below globalThis.gitinfo.
If we remove this, we can remove a lot of code from other files as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yepp, will fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adressed, please check
| <template x-if="$store.root?.isDevelopment && $store.websocketEventConsoleStore"> | ||
| <div | ||
| class="ws-console" | ||
| x-init="$store.websocketEventConsoleStore.init()" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Init function in alpine stores is called implicitly when the store is registered. This would call it twice. If you want to call a function every time the component is displayed, use x-create.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adressed, please check
| </head> | ||
| <body> | ||
| <div x-data> | ||
| <template x-if="$store.root?.isDevelopment && $store.websocketEventConsoleStore"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should discard root store in favor of globalThis in index.html as root store does not have it's own script file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adressed, please check
| <button | ||
| class="btn" | ||
| x-show="!$store.websocketEventConsoleStore.captureEnabled" | ||
| @click="$store.websocketEventConsoleStore.startCapture().catch(() => {})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These catches would supress errors, we should let them dump to console as usual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adressed, please check
| _broadcastSeq: 0, | ||
|
|
||
| init() { | ||
| const rootStore = window.Alpine?.store ? window.Alpine.store("root") : undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use window.Alpine, import stores from their scripts instead. But here root store should rather be discarded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adressed, please check
| PrintStyle.info(f"[StateSyncHandler] disconnect sid={sid}") | ||
|
|
||
| async def process_event(self, event_type: str, data: dict, sid: str) -> dict | WebSocketResult | None: | ||
| correlation_id = data.get("correlationId") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting called multiple time in short period with the same state_request type when I switch chats in the UI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The duplicated event should be fixed now.
…onsole **Details:** - Enforced singleton creation for `WebSocketHandler` (new `SingletonInstantiationError`, `get_instance`, reset helpers) and refactored `run_ui.py` plus all handlers/tests to instantiate via the factory. - Instrumented `WebSocketManager` with diagnostic watcher tracking, inbound/outbound/lifecycle summaries, and async broadcasts for `ws_lifecycle_connect|disconnect`; added `ws_dev_console_event` payloads and lifecycle payload metadata. - Extended Developer Settings with `uvicorn_access_logs_enabled` and buttons to open the WebSocket Test Harness and new Event Console modal; persisted defaults in `settings.py`. - Added Event Console frontend (Alpine store + modal) that subscribes/unsubscribes via `ws_event_console_*`, renders diagnostics, filters handled events, and supports reconnect/clear UX. - Updated developer harness handler to handle console subscriptions, plus spec contracts, quickstart, data-model, plan, research, and implementation log to describe singleton rules, diagnostics streams, and lifecycle events; marked Phase N tasks T150–T157 complete. - Documented diagnostics workflow, Event Console behavior, and WebSocket rules in `docs/websocket-infrastructure.md` and `.cursor/rules/agent-zero-global-prompt.mdc`. - Added/adjusted unit tests for singleton enforcement, diagnostics streams, lifecycle broadcasts, and settings UI. **Tests**: `PYTHONPATH=. .venv/bin/python -m pytest tests/test_websocket_handlers.py tests/test_websocket_manager.py`
- Offload handler dispatch + lifecycle hooks via DeferredTask worker; marshal socketio emit/disconnect back to dispatcher loop (FR-DISPATCH-001)
- Fix validate_session to avoid awaiting disconnect while holding sync lock (MIG-010)
- Enforce client -> server {ts,data} payload wrapper; unwrap on server while preserving correlationId (FR-ENVELOPE-UNIFORM)
- Raise ConnectionNotFoundError for unknown sids; update diagnostics watcher emission accordingly (FR-045)
- Remove eager WS auto-connect and dev harness bootstrap import; add proper harness/event console subscribe detach (FR-035, NFR-DIAG-001, FR-DIAG-001)
- Update docs/spec/tasks and tests to match new contracts and behavior
Tests: pytest -q tests/test_websocket_manager.py tests/test_websocket_harness.py
…rver to hypercorn - Replace poll-only UI updates with shared `applySnapshot()` + `buildStateRequestPayload()` so both `/poll` and `state_push` drive the exact same DOM propagation (`webui/index.js`) - Add poll-shaped snapshot builder + schema guard for state sync parity (`python/helpers/snapshot.py`), and refactor `/poll` to use it (`python/api/poll.py`) - Add WebSocket state sync backend: per-SID `StateMonitor` dirty tracking + coalesced `state_push` emission (25ms window) (`python/helpers/state_monitor.py`) plus `StateSyncHandler` `state_request` handshake + gating (`python/websocket_handlers/state_sync_handler.py`) - Make multi-tab list updates reliable by broadcasting dirty for global metadata changes (log/notifications/tasks/chats) while keeping high-frequency log item updates context-scoped (`python/helpers/log.py`, `python/helpers/notification.py`, `python/helpers/task_scheduler.py`, `python/api/chat_create.py`, `python/api/chat_reset.py`, `python/api/chat_remove.py`) - Harden shared-state concurrency: add `threading.RLock` guards and accessor APIs for `AgentContext._contexts`, `Log`, and `NotificationManager`; serialize logs under lock to avoid concurrent mutation (`agent.py`, `python/helpers/log.py`, `python/helpers/notification.py`, `python/helpers/persist_chat.py`, `python/api/api_log_get.py`, `python/api/notifications_*`) - Fix prompt/template parsing so fenced blocks are only stripped for full JSON templates (preserves fenced markdown examples) (`python/helpers/files.py`, `agent.py`) - Improve CSRF + reconnect stability: dedupe `/csrf_token` fetches and add `invalidateCsrfToken()` (`webui/js/api.js`), cache + proactively refresh WS preflight, disable Socket.IO auto-reconnect, and run preflight-gated reconnect with backoff (`webui/js/websocket.js`) - Refresh active WebSocket session CSRF expiry on `/csrf_token` POST via global manager access (`python/api/csrf_token.py`, `python/helpers/websocket_manager.py`) - Replace top-bar connected indicator with 4-state sync status UI and always-mounted sync store (HEALTHY/HANDSHAKE_PENDING/DEGRADED/DISCONNECTED) (`webui/components/sync/sync-store.js`, `webui/components/sync/sync-status.html`, `webui/components/chat/top-section/chat-top.html`) - Restrict polling to degraded fallback only (1Hz idle, short 4Hz bursts on updates), stop polling when disconnected, and trigger `state_request` on context switch (`webui/index.js`, `webui/components/sidebar/chats/chats-store.js`) - Improve restart UX: emit backend "Restarting..." notification before `/restart`, remove `/health` polling loop, and rely on WS reconnect (`webui/components/sidebar/chats/chats-store.js`) - Expand dev tooling: Event Console explicit capture ON/OFF with persisted setting + stable entry keys (`webui/components/settings/developer/websocket-event-console-*`, `docs/websocket-infrastructure.md`); Harness adds state sync/no-poll, context switching, degraded fallback, resync trigger tests with PASS/FAIL logging and safer toast handling (`webui/components/settings/developer/websocket-test-store.js`, `webui/components/settings/developer/websocket-tester.html`, `webui/components/notifications/notification-store.js`) - Add tests for snapshot schema/parity, StateMonitor coalescing + per-SID isolation, StateSyncHandler handshake/gating/initial snapshot, and CSRF token refresh behavior (`tests/test_snapshot_*`, `tests/test_state_*`, `tests/test_multi_tab_isolation.py`, `tests/test_websocket_csrf.py`, `tests/test_websocket_handlers.py`) - Migrate UI server from Uvicorn to Hypercorn (incl. optional `--ssl` + dev CA/cert generation + loopback `/ssl/ca.pem` download) and update deps (`run_ui.py`, `run_ui_ssl.py`, `python/helpers/runtime.py`, `requirements.txt`) - Add missing UI assets referenced in development flows (`webui/vendor/ace-min/ace.min.css`, `webui/public/dev_testing.svg`)
…n handshake - Add optional `mark_dirty` to `activate_project()` / `deactivate_project()` and emit `state_monitor.mark_dirty_all()` once per operation (including bulk reactivate/deactivate). - Keep `selectedContext` refreshed when contexts metadata updates so project badges/state do not go stale in push mode. - Replace ws-CSRF-flag logic with `validate_ws_origin()` (Origin/Referer normalized, host+port matched) and reject mismatches during Socket.IO `connect`. - Make `/csrf_token` GET-only again (remove WS preflight POST and ws_csrf TTL/expires payload). - Remove server-side CSRF expiry tracking + validation loop (`ConnectionInfo.csrf_expires_at`, `update_csrf_expiry`, `validate_session`). - Simplify frontend WS client: drop preflight + custom reconnect scheduler; re-enable Socket.IO built-in reconnection. - Update docs + tests to match Origin-based baseline (remove CSRF-preflight semantics/tests).
this was preventing correct shutdowns/restarts: pgjones/hypercorn#308
…gin Check - Backend: python/helpers/websocket.py sets requires_csrf = requires_auth, and run_ui.py enforces CSRF at Socket.IO connect. - Frontend: webui/js/api.js exports getCsrfToken(), and webui/js/websocket.js sends auth.csrf_token on connect/reconnect.
…checks, add status tooltip - Wire Socket.IO `cors_allowed_origins` to `validate_ws_origin()` so the transport enforces Origin allowlisting. - Harden `validate_ws_origin()` to accept reverse-proxy setups by checking Host + `X-Forwarded-Host`/`X-Forwarded-Proto` candidates and handling `SERVER_PORT` robustly. - Frontend: ensure CSRF is initialized before starting the Engine.IO handshake, and auto-retry connects after `connect_error` with exponential backoff (fixes long-lived tabs after restart). - Sync: always re-send `state_request` on every Socket.IO connect so per-sid projections are re-established (prevents “healthy but stalled” tabs after sleep/suspend). - UI: add native hover tooltip for the sync indicator and make hover reliable by applying the tooltip on the wrapper and disabling pointer events on the inner SVG.
- Make WebSocketManager namespace-aware (connection identity is (namespace, sid); per-namespace routing/buffers/watchers/handler registry). - Make WebSocketHandler namespace-aware; propagate namespace through emit_to/broadcast/request/request_all helpers. - Make StateMonitor + state_sync handler namespace-aware (projections and pushes keyed by (namespace, sid)). - Wire dynamic namespace registration in run_ui.py with per-namespace auth/CSRF/origin checks and deterministic UNKNOWN_NAMESPACE connect_error; reserve root "/" with deterministic NO_HANDLERS for request-style calls. - Add deterministic namespace discovery from python/websocket_handlers (file entries + folder entries, optional _handler suffix stripping, empty-folder warning, fail-fast invalid modules) and reserved root mapping (_default.py -> "/"). - Refactor frontend websocket client to namespaced instances (createNamespacedClient/getNamespacedClient); remove deprecated global surface (no broadcast/requestAll; no handler include/exclude options). - Migrate sync + developer stores to explicit namespaces (/state_sync, /dev_websocket_test). - Update/add tests for namespace isolation, root reserved behavior, namespace security, namespace discovery, unknown-namespace integration, and minimal client API surface; refresh existing websocket/state-sync tests accordingly. - Update websocket docs + 007-websocket-namespaces spec artifacts (tasks/quickstart/implementation log).
b0550c3 to
b773c69
Compare
…user bubbles - Preserve LogItem.id in persist_chat._deserialize_log() so user message GUIDs survive backend restarts. - Add regression test ensuring log ids roundtrip through serialize/deserialize.
…, runtimeInfo injection) - Review items addressed: A1-A20 (incl. the remaining A11/A13 abstraction + lint-only diff rollback) - Snapshot module: - Rename snapshot module to `python/helpers/state_snapshot.py` and keep `python/helpers/snapshot.py` as a compatibility re-export - Add `SnapshotV1` `TypedDict` and stricter schema/type validation for poll-shaped snapshots - State sync abstraction (A11/A13): - Add `StateRequestV1` + `parse_state_request_payload()` in `python/helpers/state_snapshot.py` as the single source of truth for `context/log_from/notifications_from/timezone` - Store a single `projection.request` in `python/helpers/state_monitor.py` and build snapshots via `build_snapshot_from_request()`, advancing cursors via `advance_state_request_after_snapshot()` - Update `python/websocket_handlers/state_sync_handler.py` to delegate parsing/validation to the snapshot module and call `update_projection(..., request=request, ...)` - Keep `/poll` compatible while routing defaults/normalization through the snapshot module (`python/api/poll.py`) - State sync request duplication (A12/A14): - Fix duplicate resync trigger on chat switch in `webui/index.js` (first snapshot after context switch no longer triggers a second `state_request`) - Strengthen request coalescing so “full” resync (lower offsets) wins over incremental (`webui/components/sync/sync-store.js`) - Locking simplification + atomic reads (A5/A6/A8): - Remove read-lock getter surfaces from `python/helpers/log.py` and `python/helpers/notification.py`, keep write locks and copy-under-lock for list iteration - Centralize state monitor “mark dirty” calls via `python/helpers/state_monitor_integration.py` and update call sites (chat/project/task/log/notification paths) - Revert `_truncate_value` typing back to the upstream-style `TypeVar` signature in `python/helpers/log.py` to eliminate lint-only diff churn - runtimeInfo propagation (A3/A4/A16-A20): - Inject `globalThis.runtimeInfo` in `webui/index.html` via `run_ui.py` placeholder replacement - Remove runtime dict propagation from `/csrf_token` + `/settings_get` and stop writing runtime globals from settings fetchers (`python/api/csrf_token.py`, `python/api/settings_get.py`, `webui/js/api.js`, `webui/js/settings.js`) - Gate dev websocket tools via `window.runtimeInfo?.isDevelopment`, use `x-create` for modal open, and remove the remaining event console error suppression (`webui/components/settings/developer/*`) - Tests: - Update/add coverage for the new state_request abstraction and keep snapshot parity/schema + namespace security green - Verified with focused suite: `63 passed`
…c progress updates - **Progress-bar correctness**: mark the state sync projection dirty when `Log.set_progress()` changes `progress` or `progress_active`, so progress-only transitions propagate even when no log items changed. - **No push storms**: change-detect progress updates and use `mark_dirty_for_context` (context-scoped), avoiding `mark_dirty_all` fan-out. - **Performance**: replace repeated hot-path imports in `python/helpers/log.py` with lazy-cached callables to reduce overhead during streaming/log updates. - **Edge case fix**: when post-monologue background memorization finishes and the chat is idle, reset the progress bar to "Waiting for input" to prevent it sticking on "Preloading knowledge...".
|
All discussion items got adressed, the duplicate event emission should now also be properly fixed. |
Check for owner before reset, do not touch if owner_no != progress_no
WebSocket support and infrastructure
Implement WebSocket singleton lifecycle, diagnostics, and developer console
Offload handler dispatch + lifecycle hooks via DeferredTask worker; marshal socketio emit/disconnect back to dispatcher loop (FR-DISPATCH-001)
Fix validate_session to avoid awaiting disconnect while holding sync lock (MIG-010)
Enforce client -> server {ts,data} payload wrapper; unwrap on server while preserving correlationId (FR-ENVELOPE-UNIFORM)
Raise ConnectionNotFoundError for unknown sids; update diagnostics watcher emission accordingly (FR-045)
Remove eager WS auto-connect and dev harness bootstrap import; add proper harness/event console subscribe detach (FR-035, NFR-DIAG-001, FR-DIAG-001)
feat(ws): replace /poll sync with WebSocket state_push; migrate UI server to hypercorn
applySnapshot()+buildStateRequestPayload()so both/pollandstate_pushdrive the exact same DOM propagation (webui/index.js)python/helpers/snapshot.py), and refactor/pollto use it (python/api/poll.py)StateMonitordirty tracking + coalescedstate_pushemission (25ms window) (python/helpers/state_monitor.py) plusStateSyncHandlerstate_requesthandshake + gating (python/websocket_handlers/state_sync_handler.py)python/helpers/log.py,python/helpers/notification.py,python/helpers/task_scheduler.py,python/api/chat_create.py,python/api/chat_reset.py,python/api/chat_remove.py)threading.RLockguards and accessor APIs forAgentContext._contexts,Log, andNotificationManager; serialize logs under lock to avoid concurrent mutation (agent.py,python/helpers/log.py,python/helpers/notification.py,python/helpers/persist_chat.py,python/api/api_log_get.py,python/api/notifications_*)/csrf_tokenfetches and addinvalidateCsrfToken()(webui/js/api.js), cache + proactively refresh WS preflight, disable Socket.IO auto-reconnect, and run preflight-gated reconnect with backoff (webui/js/websocket.js)/csrf_tokenPOST via global manager access (python/api/csrf_token.py,python/helpers/websocket_manager.py)webui/components/sync/sync-store.js,webui/components/sync/sync-status.html,webui/components/chat/top-section/chat-top.html)state_requeston context switch (webui/index.js,webui/components/sidebar/chats/chats-store.js)/restart, remove/healthpolling loop, and rely on WS reconnect (webui/components/sidebar/chats/chats-store.js)webui/components/settings/developer/websocket-event-console-*,docs/websocket-infrastructure.md); Harness adds state sync/no-poll, context switching, degraded fallback, resync trigger tests with PASS/FAIL logging and safer toast handling (webui/components/settings/developer/websocket-test-store.js,webui/components/settings/developer/websocket-tester.html,webui/components/notifications/notification-store.js)tests/test_snapshot_*,tests/test_state_*,tests/test_multi_tab_isolation.py,tests/test_websocket_csrf.py,tests/test_websocket_handlers.py)--ssl+ dev CA/cert generation + loopback/ssl/ca.pemdownload) and update deps (run_ui.py,run_ui_ssl.py,python/helpers/runtime.py,requirements.txt)webui/vendor/ace-min/ace.min.css,webui/public/dev_testing.svg)