Skip to content

RnD sync — April 28 to May 5#861

Open
testradav wants to merge 731 commits intosecurity-union:RnDfrom
testradav:RnD
Open

RnD sync — April 28 to May 5#861
testradav wants to merge 731 commits intosecurity-union:RnDfrom
testradav:RnD

Conversation

@testradav
Copy link
Copy Markdown
Collaborator

@testradav testradav commented Apr 28, 2026

May 5

Encoder recovery & observability

  • feat: encoder auto-recovery on closed-codec / VPX memory errors — when VideoEncoder.encode() or .configure() throws InvalidStateError: closed codec or a VPX Memory allocation error, the encoder now tears itself down and restarts automatically, no user intervention (camera toggle) required. Camera encoder reacquires getUserMedia and cleans up partially-initialized capture resources before retrying; screen encoder reacquires a fresh getDisplayMedia stream and re-emits a Started event so the UI rebinds to the replacement stream. Exponential backoff (500ms × min(attempt, 4), max 5 restarts before surfacing Failed); restart counter resets on the first successful encode so transient errors don't accumulate across long-lived sessions. Adds a CodecState::Closed guard before every configure() call. Unit tests lock in the fatal-vs-non-fatal classification predicates and the screen-capture-reacquisition decision.

  • feat: encoder error counters and frames-emitted in HealthPacket — adds 10 new optional uint64 fields to HealthPacket tracking encoder errors by class (closed_codec, vpx_mem_alloc, configure_fatal, generic) plus frames_submitted_ok for both camera and screen encoders. Closes a Prometheus blind spot: encoder_output_fps could show steady 30 fps target while both encoders were dead. A new shared classify_encode_error(msg) -> EncodeErrorBucket helper routes errors identically across both encoders. Counters are non-zero-guarded so clean-state packets stay small.

  • feat: distinguish handshake failures from session drops in connection loss — adds ConnectionLostReason::HandshakeFailed / SessionDropped and threads it through the full on_connection_lost callback chain. Log lines now read [HANDSHAKE FAILED] or [SESSION DROPPED] instead of a single undifferentiated message. Two new health-packet counters. Transport-layer handshake_complete flag set when ready() / onopen resolves; the close handlers consult it to classify. Triple-promise fired guard for WebTransport covers ready.catch / closed.then / closed.catch; an analogous ws_fired guard for WebSocket covers the common Close+Error double-fire. Active-connection guard placed BEFORE counter increments so election-probe failures to losing servers don't inflate metrics.

  • fix: restore camera encoder shared adaptive-quality metrics — the encoder-loop rewrite that introduced auto-recovery inadvertently deleted 12 of 14 public accessor methods on CameraEncoder (shared video / audio tier index, encoder output fps, fps ratio, worst-peer fps, bitrate ratio, target bitrate kbps, tier transitions, climb-limiter snapshot, dwell samples, re-election completed signal). The encoder built clean in isolation (no internal callers) but the UI consumes all twelve, so the workspace cargo check failed at the rollup gate. This change restores both the API surface and the backing shared state + control-loop wiring (12 new struct fields, encoder loop writes to fps_ratio / worst_peer_fps / bitrate_ratio / target_bitrate_kbps every diagnostics tick, tier-transition draining, re-election signal swap(false, AcqRel) consume, climb-limiter snapshot updates, dwell-sample draining, WS self-congestion sliding-window check).

Re-election & connection-preservation hardening

  • fix: treat sustained implausible RTT as a re-election trigger — the existing RTT_SANITY_MAX_MS plausibility filter silently drops anomalous RTT measurements (e.g., a 2.6× clock-rate bug in the WebSocket heartbeat path), but the elevated-RTT degradation detector relies on accepted samples — so a streak of all-implausible measurements would never trigger re-election. Recent incident data showed 255 consecutive implausible discards over 6 minutes with no re-election ever firing. This change adds a per-connection consecutive_implausible_discards counter; the 1Hz watchdog consults the counter on the active connection and fires re-election when the streak exceeds REELECTION_IMPLAUSIBLE_DISCARDS_THRESHOLD (10 — chosen for ~10s wall-clock, riding out one-shot late ACKs without thrashing). Defense-in-depth alongside the root-cause clock-drift fix; catches any future class of clock/time-base brokenness.

  • feat: post-rebase re-election retry when candidates were unavailable — when degradation triggers a rebase but only one server is configured (a transient relay outage or a deliberate WebTransportOnly / WebSocketOnly user pref), today's behavior is "rebase RTT baseline and never retry" — so a brief relay outage strands the user on the rebased connection forever. This change adds a 30s retry timer that re-evaluates candidate availability after each rebase and triggers re-election when the URL list grew (e.g., a fresh room-token refresh added back the alternate transport). Retry budget capped at 3 attempts (~90s wall-clock). The retry counter resets on a full reconnect or a successful election. The user transport preference is now plumbed through VideoCallClientOptionsConnectionManagerOptions as allow_post_rebase_retry: bool, so deliberate WebTransportOnly / WebSocketOnly selections never trigger an auto-recovery that overrides the user's choice.

  • fix: re-election preserves old connection on total candidate failure — when automatic re-election fires and ALL new candidates fail their handshake before producing valid RTT samples, today's election-deadline path force-disconnects the user even though the original active connection is often still receiving media + heartbeat. Recent incident: a healthy WebTransport (69ms baseline RTT), 5-sample RTT spike triggered re-election, both new candidates failed handshake within 14ms of each other (relay was briefly unreachable), election deadline expired — user was kicked off a still-live connection and was absent 16 minutes. This change tracks last_inbound_at_ms per connection and, in the election-failure branch, runs a new try_preserve_old_connection_on_candidate_failure helper that restores the old connection + RTT measurement when its inbound freshness is ≤ 5s, schedules a 30s retry, and emits Connected (NOT Failed). A reelection_preserved_once guard prevents pinning a genuinely dead connection indefinitely — if the 30s retry's election ALSO fails, the guard forces the disconnect path.

UI polish & dev hygiene

  • feat: token-based styling system replacing hardcoded values — three-tier token system: dioxus-ui/static/tokens-v0.json (frozen contract, drift-checked) → global.css :root block (CSS source of truth) → dioxus-ui/src/theme.rs (Rust-side constants for SVG/charts). Replaces hardcoded color/spacing/effects values across style.css, signal_quality.rs, neteq_chart.rs, search_modal.rs, meeting.rs, attendants.rs, appearance_settings_panel.rs, guest_join.rs, and routing.rs with var(--token-name) / theme::TOKEN references. Two drift-check scripts (scripts/check-token-drift.sh and scripts/check-hardcoded-colors.sh) gate against regressions, plus a new CI workflow that fires on every PR. Sets a foundation for future theming support; ships with a 220-line architecture doc at dioxus-ui/docs/styling-tokens.md.

  • fix: crop button state persists across new-peer joins — crop-button state moved from DOM class_list.add/remove manipulation to a Dioxus CroppedTilesCtx: Signal<HashMap<String, bool>> provided at AttendantsComponent scope, surviving tile remounts and layout switches (regular grid → split screen-share → full-bleed). on_peer_removed prunes both the peer_id and screen-share-{peer_id} keys to prevent map growth across long meetings. Crop button is hidden when the peer's video is disabled; shows a green active state matching the mic icon. 20 new unit tests in canvas_generator.rs (4 crop-specific: toggle round-trip, cleanup-on-removal, missing-ID default-false, None-context default-false).

  • chore: ALLOW_ANONYMOUS now defaults to false for local development — flips the env-var fallback in docker/docker-compose.yaml and start_dev.sh from true to false, aligning local dev with production's already-secure default. Production unaffected; E2E unaffected (docker-compose.e2e.yaml doesn't set the var either, and E2E tests bypass auth via JWT cookie injection). New devs running docker-compose up / ./start_dev.sh without OAuth setup will need to explicitly export ALLOW_ANONYMOUS=true to keep the unauthenticated fallback.

  • feat: per-peer transport (WT/WS) badge in diagnostics & signal popups — each client now stamps its active transport into the periodic HeartbeatMetadata proto (new TransportType enum, default 0 = unknown for forward/backward compat). Receivers track the latest value per peer and forward it through the existing peer_status diagnostics event as a peer_transport text metric. Two UI surfaces consume it: the diagnostics popup's Per-Peer Summary and the signal-quality popup header — both render a compact WT / WS / em-dash pill per peer. Wire cost is ~2 bytes per heartbeat when set; 0 bytes for default/unknown via proto3 default-elision. Receive path uses enum_value().unwrap_or(...) so future enum variants from newer clients don't panic. Signal writes are gated on actual change so heartbeats don't trigger UI re-renders. Two new Playwright E2E specs cover both popup surfaces.

May 1

In-meeting settings

  • fix: avoid stale microphone and speaker list in the in-meeting settings modal — opening the gear icon during a call could surface an outdated device list because enumeration only ran on initial load or in response to a hardware-change event. The fix adds a safe re-enumeration that runs once each time the modal opens: device lists update and the current selection is re-validated (kept if still present in the new list, falls back to the first device otherwise) without emitting any of the on_loaded / on_audio_selected / on_video_selected / on_audio_output_selected callbacks — so no encoder restart or speaker switch fires as a side-effect of opening settings. New Playwright spec joins a meeting, captures mic/cam button states, opens and closes the modal, and asserts the states are unchanged.

Home page UX

  • feat: unified meetings list with ownership-aware UI gating — replaces the separate "Previously Joined" and "My Meetings" sections on the home page with a single unified meetings list backed by a new GET /api/v1/meetings/feed endpoint. Each row carries an "Owner" badge when the authenticated user owns the meeting, and edit / delete / end-meeting actions gate on a server-computed is_owner flag — the only authoritative ownership signal in the response. The mutating handlers continue to enforce creator_id == authenticated_user_id independently, so the UI flag is cosmetic; the server is authoritative.

  • perf: meetings feed folds participant counts into a single SQL round-trip — the new db_meetings::list_feed_for_user query computes participant_count and waiting_count per row via LEFT JOIN LATERAL subqueries, eliminating the prior 1+2N per-row pattern. Ships a partial-composite-index migration idx_meeting_participants_meeting_id_admitted / _waiting (partial on meeting_id filtered by status) so each count resolves via index range scan instead of heap fetches under load. The user-side dedup is via the lateral aggregate (MAX(admitted_at)) — a meeting the user both owns and was admitted into appears once. Limit is hard-capped at 200 rows server-side; negative values rejected with 400 INVALID_INPUT.

  • test: three-layer coverage for the new feed and ownership-gating — backend integration tests (meeting-api/tests/list_feed_tests.rs) covering owner-only, admitted-only, mixed-deduplication, and pagination boundary; Dioxus unit tests (dioxus-ui/tests/meetings_list_owner_gating.rs) asserting the owner badge and management actions render only on is_owner=true rows; Playwright spec (e2e/tests/meetings-ownership.spec.ts) for the ownership UI under realistic auth.

April 29

Sign-out reliability

  • fix: sign-out leaving the user in an anonymous state with no way to re-sign-in — adds an anonymous-profile filter (!user_id.starts_with("anon-")) to the data-load path on both the home page and the meeting page so a stale guest-session profile no longer enters the user_profile signal after sign-out. The home-page logout flow was rewritten to clear local state synchronously, navigate via the SPA router, and fire-and-forget a background fetch to /logout for backend session-cookie invalidation — avoiding the OIDC end_session_endpoint redirect chain that was identified as the root cause of the "blank /logout screen" + "stuck anonymous" symptom. Includes an in-code DESIGN NOTE documenting the trade-off (the IdP session may stay alive after sign-out) and the long-term recommendation to drive the redirect chain from the backend after the SPA has already navigated away. The auth fast-path check_session() no longer bails out on a stale vc_guest_session_id marker when the deployment uses server-side OAuth — clears the marker and falls through to the network check.

Video quality & transport

  • feat: adaptive initial-tier selection for screen share — replaces the fixed "always start at high (1080p / 2500 kbps)" warm-start with a pure decision function initial_screen_tier(rtt_ms, camera_tier_index) that picks high / medium / low based on the network signals at the moment screen-sharing starts. RTT ≥ 400ms drops to low; RTT ≥ 200ms or camera already at sd/below drops to medium; otherwise high. Cold-start (no signals) keeps the existing optimistic high-tier default and lets the PID loop adapt. The chosen tier is applied to the encoder atomics before the encoding loop starts, so the very first encoded frame is at the right bitrate — readable text on constrained presenter uplinks without waiting for the PID to ramp down. The screen-share env-var default SCREEN_BITRATE_KBPS was bumped from 100 (below the low-tier min_bitrate_kbps = 250) to 1200 (matches medium-tier ideal_bitrate_kbps).

  • docs: comprehensive screen-share capacity-planning guidedocs/server-sizing-guide.md gains a new "Screen Share Bandwidth" section covering per-tier bitrate cost (high / medium / low), per-N relay-egress multiplier tables, mixed-mode scenarios (camera + screen + audio), VBR overshoot analysis (analytical 1.5–3× burst above ideal_bitrate_kbps during scroll-heavy frames), three mitigation options (CBR for low tier, halving ideal, or NIC-headroom-only), and a precise chrome://webrtc-internals measurement protocol with a "pending measurements" table for empirical validation. docs/Monitoring_Production.md gains a "Screen Share Egress: Operator Callout" section with a Prometheus query (rate(relay_room_bytes_total{direction="outbound"}[1m])), a RoomEgressHigh alert recommendation, and a 67%-fan-out rule of thumb for a 20-person meeting at high tier.

Home page UX

  • feat: home page meeting-list polish and tooltip-driven densityMy Meetings and Previously Joined rows are now single-line: meeting ID + state pill on the left, edit/delete on the right. All the metadata that previously cluttered each row — duration, time range, attendees, password, etc. — moved into a body-level tooltip portal that escapes every overflow: hidden / overflow-y: auto ancestor, positions itself at the cursor with edge detection (flips when near the right or bottom edge), fades in over 120ms, and respects prefers-reduced-motion. State pills are now title-case (Active / Idle / Ended) with subtle borders and lower-opacity backgrounds; the Owner badge gets the same softer treatment. Per-state tooltip layouts: Active shows Created-on / Started-on / Duration / Attendees / Waiting / Password; Idle shows Created-on / Last-active-on / Password; Ended shows Created-on / Last-active-on / Duration / Password; Previously-Joined shows the same as My Meetings, with Created on shown only when the user owns the meeting.

  • feat: format_duration and format_datetime helpersformat_duration now handles multi-day durations (1d 1h 1m 1s instead of overflowing the hours field). New format_datetime helper renders dates as Apr 28, 3:07 PM for the tooltip rows. Eight new host-target unit tests cover boundary cases (zero, sub-minute, minutes-and-seconds, hours-no-seconds, just-under-24h, exactly-24h, multi-day, exactly-48h).

  • fix: meeting-list timestamps in milliseconds (was inconsistent seconds)list_meetings and create_meeting API handlers were emitting created_at / started_at / ended_at in Unix seconds via .timestamp() while the rest of the meeting-API and the frontend treat these as milliseconds. Result: ended meetings rendered with identical start/end times and a duration of 0s. Both handlers now emit ms via .timestamp_millis(), matching the already-correct get_meeting / end_meeting. JoinedMeetingSummary carries a new created_at: i64 (ms) so the Previously-Joined tooltip can render Created-on for owned meetings. Test coverage in meeting_crud_tests.rs locks in the new ms-magnitude bound (MS_LOWER_BOUND = 1_000_000_000_000) for test_list_meetings_success and a new test_list_meetings_returns_ended_at_in_milliseconds that exercises an idle → active → ended cycle.

  • fix: activate() now refreshes timestamps when reactivating an idle or ended meeting — surfaced during the home-page polish review. activate() previously only updated state = 'active', leaving started_at at the original-creation time and ended_at stale. New SQL refreshes started_at = NOW() and clears ended_at on idle → active and ended → active transitions while staying idempotent on active → active (so other attendees joining don't bump timestamps). New meeting-api/tests/activate_semantics_tests.rs locks in the refresh-on-reactivation, the active → active byte-identical idempotency, the started_at >= created_at invariant, and created_at immutability across reactivation.

Diagnostics

  • fix: refine per-peer diagnostics panels — seven targeted UX improvements to the SignalQualityPopup: (1) legend help text bumped from 6px to 8pt (was sub-pixel on most displays); (2) all open peer popups scroll together — scrolling any chart moves all others to the same time slot; (3) unchecking a metric in the legend also hides it from the hover tooltip; (4) "Latency" renamed to "Server RTT" throughout with help text clarifying it is client-to-server and identical for all peers in a session; (5) faint vertical grid lines at every 10-second mark on the chart; (6) RTT polyline made more subtle (40% opacity, thinner stroke, sparser dash) so it recedes behind the quality lines.

  • feat: redesigned non-owner pre-join card — prominent h2 heading, monospace meeting-ID pill chip with ellipsis-on-overflow, clean subtitle, and a glass divider — replaces the prior generic "Ready to join the meeting?" text.

Test coverage

  • test: Playwright E2E coverage for the four guest user-flows — three new Playwright spec files (guest-leave.spec.ts, guest-rejection.spec.ts, guest-waiting-room.spec.ts) covering: guest-leave (host sees the tile removed within the grace period); guest-rejection (UI surface plus an API-guard assertion that the rejected guest's DB record carries status=rejected, not just that the UI shows "Entry denied" — closes a class of regression where the UI updates but the DB row is wrong); sequential waiting-room admission with multiple queued guests; and an admitted_can_admit=true regression net that locks in the live-sync auth path (admitted non-host can admit a queued guest). Multi-browser orchestration consistent with speaker-highlight.spec.ts — separate chromium.launch() per participant, per-context auth cookie injection, try/finally cleanup, and Date.now() meeting IDs to avoid parallel-run collisions.

  • test: Playwright coverage for OAuth-fallback display-name handling — new auth-display-name.spec.ts covers the three non-OAuth fallback scenarios (empty input on initial load with localStorage clear, localStorage restore on page load, direct-navigation pre-fill of meeting-page input from localStorage). Documents the OAuth-specific paths that can't be exercised under ENABLE_OAUTH=false as a known coverage gap.

April 28

Security & access control

  • feat: per-user rate limiting and NATS packet sanitization for display-name changes — 5/min/user rate limit on display-name changes with a shared budget between rename and rejoin to close the leave-and-rejoin bypass; NATS forwarding sanitizes UTF-8, validates display names, and rewrites room_id authoritatively before publishing; static error messages prevent reflected-input info disclosure; client-side revalidation of server-sent name changes.

  • feat: guest user flow with waiting room support — guests join via a guest:{uuid} user_id, bypass OAuth, and enter through the waiting room. Includes DB migration, is_guest JWT claim, and UI integration.

  • feat: meeting ends for all participants when host leaves — adds an end_on_host_leave meeting toggle; when the host disconnects with it on, MEETING_ENDED is broadcast and the host tile cleans up before the overlay renders. Also closes a TOCTOU window for late joiners in the join_attendee transaction.

  • fix: admitted_can_admit now syncs live via NATS settings events — toggling host permissions mid-meeting now takes effect for already-joined participants. Adds a MEETING_SETTINGS_UPDATED protobuf event; the client refetches status on receipt and updates host-control visibility reactively. Unauthenticated guests receive a read-only observer token (meeting-bound, 30-min TTL) for the refetch.

Video quality & transport

  • perf: adaptive quality now works for WebSocket-only users — ports the WebTransport bounded-channel + CONGESTION-via-NATS pattern to WebSocket, plus client-side self-detection of WS backpressure that force-steps-down video quality on local drops.

  • fix: PID integrator stuck at maximum with no recovery — adds a 30-second saturation watchdog that force-resets the integrator when pid_output sits at PID_OUTPUT_MAX too long, breaking the feedback loop where reduced bitrate caused low received FPS which kept the PID saturated indefinitely.

  • perf: content sharing starts at midpoint quality tier — screen-share now starts at the midpoint tier (720p / 8fps / 1200 kbps) instead of the maximum. The PID controller adapts in either direction — stepping up to 1080p when bandwidth is plentiful, or stepping down under congestion. contentHint = 'detail' applied to the MediaStreamTrack for better codec behaviour on text and fine detail.

  • fix: upgrade web-transport-quinn 0.8.1 → 0.11.9; fix WT inbound reception — fixes the accept_uni lost-waker bug that silently dropped inbound WebTransport streams. Corrects the inbound-read pattern from read_to_end on persistent streams to an explicit length-prefixed frame loop. Validated by a 6-bot 50/50 WT/WS load test.

  • fix: join_meeting no longer panics on unexpected state — eliminates a panic code path in join_meeting by returning an appropriate error result instead.

Participant grid & screen share

  • feat: stable join-time tile ordering with overflow-speaker promotion — tile sort is now deterministic by join time instead of speech activity, eliminating grid shuffling as people take turns speaking. Loud speakers in the overflow set swap into the visible set by displacing least-recently-active peers; density mode escalates automatically so every active speaker stays on-screen.

  • feat: resizable share screen area — draggable vertical handle (clamped 30–85%) between the shared content panel and the participants panel; tile grid collapses to 1 column when the right panel is narrow; stopping a share restores the normal grid without layout artifacts.

  • fix: new participants see active screen shares immediately — on join, a keyframe request (PLI) is fired on visibility transition so shared content renders immediately instead of staying blank for late joiners.

  • fix: improved tile grid layout during screen share — better tile placement and density when a screen share is active; resolves a layout collision between tile-ordering and screen-share code paths.

Settings & meeting controls

  • feat: meeting controls bar collapses and docks — control bar behaves like a Mac dock: primary buttons (mic/cam) always visible; secondary buttons (screen-share, peer-list, diagnostics, settings) collapse after 1 second of inactivity; the bar fades to 10% opacity after 4 seconds and reveals on hover or tap. User can dock the bar at bottom, left, or right.

  • feat: Appearance tab in Settings — speaker-highlight glow customization — each participant can choose glow color (preset palette or custom hex), tune outer/inner intensity, or disable the effect entirely. Debounced localStorage writes cancel on navigation to avoid write races. Shared calculation helper eliminates formula duplication between runtime and settings preview.

  • feat: sticky protocol selector with segmented control in Network settings — replaces the dropdown + immediate confirm() dialog with a three-pill segmented control (Auto · WebTransport · WebSocket) and a deferred Apply button. A "Remember protocol choice" toggle (off by default) writes to localStorage for persistence across restarts or sessionStorage for the current tab only. Selecting Auto always clears both stores.

Home page UX

  • feat: home page input validation and meeting creation overhaulCreate a New Meeting renamed to Generate a New Meeting ID; the button renders exclusively with Start or Join Meeting (the two never co-exist — empty field shows Generate, filled field shows Start/Join). Generate populates the meeting-id field rather than navigating directly; the user clicks Start/Join to enter. Per-keystroke inline validation shows only when an invalid character is typed, reusing the canonical is_allowed_display_name_char predicate so client and server stay in sync. Info-icon tooltips provide on-demand allowed-character guidance with full keyboard/Escape/outside-click dismissal. Browser tab title cleaned up to videocall.rs.

  • fix: anonymous profiles no longer hide the sign-in button — with ALLOW_ANONYMOUS=true, the backend returns a valid anonymous profile that was being treated as a logged-in user, permanently hiding the Sign In button. Fixed by filtering user_id.starts_with("anon-") from the auth-dropdown condition.

  • feat: previously-joined meetings on home page — new "Previously Joined" section showing the user's last 5 admitted meetings (owned or not) ordered by most-recent admission. Backed by a new GET /api/v1/meetings/joined?limit=N endpoint with an INNER JOIN meeting_participants query. State pills reuse the existing state-active / state-idle / state-ended vocabulary; a gold owner badge marks owned rows. Expand/collapse state persists in localStorage per section across reloads.

Other

  • feat: redesigned pre-join screen — matches the visual language of the rest of the app; settings toggles extracted into a PreJoinSettingsCard component for reuse.

  • fix: speaker highlight glow resets correctly after speaking stops — tile border now returns to neutral after speaking stops (previously stayed at the glow color); host tile retains speaking glow during screen-sharing. Glow toggle visuals updated to iOS green (ON) / gray (OFF).

  • fix: console log upload CORS header and gzip compression — adds X-Chunk-Seq to the CORS allow_headers list (was rejecting upload POSTs after a successful OPTIONS preflight); console log chunks are now gzip-compressed at write time (~10:1 ratio on NDJSON text, ~10× less disk I/O per chunk).

jboyd01 and others added 30 commits April 20, 2026 16:13
- Replace unbounded HashSet with bounded VecDeque (max 16 entries)
  to prevent unbounded growth during long-lived sessions with many
  re-elections. Oldest entries are evicted when the cap is reached.
- Remove ephemeral dev-session artifacts from settings.local.json
  (hard-coded pod name, overly broad Read permission).
Error logging on upload failure uses the unwrapped console.error
(originals.error) which bypasses the collector, preventing a
feedback loop during prolonged backend outages.
Prevents potential underflow if websocket_drop_count() ever resets.
The counter is currently a static AtomicU64 (strictly monotonic), but
saturating_sub is defensive and costs nothing.
Explains why 0.01 is used instead of f64::EPSILON — the PID
integrator accumulates many small f64 additions, so rounding
error can easily exceed machine epsilon.
The set is built across 3 phases separated by 250+ lines of tile
rendering code. A future refactor could desync the pinned-peer
insertion from the dedup check. Comments now anchor the contract.
macOS Finder uses "natural sort" (localizedStandardCompare) which
treats digit runs inside hex strings as numbers, breaking the
chronological ordering of UUIDv7-based filenames. Replace the hex
UUID suffix with a zero-padded 5-digit chunk sequence counter
(00001, 00002, ...) that sorts correctly in every file manager.

Client sends X-Chunk-Seq header with each upload. Server zero-pads
to 5 digits for the filename. sendBeacon fallback (which can't set
headers) retains UUIDv7 suffix — these sort after numbered chunks.
…ecovery

fix: PID integrator stuck at maximum with no recovery
Resolves conflict in dioxus-ui/src/components/waiting_room.rs
VideoCallClientOptions: keep display_name.clone() from this branch
and add is_guest from PR-staging's guest-support work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…am-c-ux-polish

fix: tile layout bounce + GUID display name (Workstream C)
…ressure-aq-activation

fix: enable adaptive quality for WebSocket-only users (OSS security-union#859)
Resolves conflict in videocall-client/src/diagnostics/encoder_bitrate_controller.rs
combining PR security-union#343's AQ_STATUS/AQ_BITRATE_CHANGE diagnostic logging with
PR security-union#337's PID-stuck watchdog + per-tier-change pid.reset() calls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…log-upload-collision

fix: console log upload collision and retry storm
…aves-meeting

When host ends meeting, meeting is closed for everyone
Resolves architectural conflict in dioxus-ui/src/components/attendants.rs:
end_on_host_leave and allow_guests toggles from PR-staging ported into
the extracted PreJoinSettingsCard component (PR security-union#340's design) rather
than left inline.

Also reverts dioxus-ui/scripts/config.js to match PR-staging - dev-only
OAuth/WebTransport config changes are out of scope for this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Design join page in same view as other pages
…on-signal-user-id-routing

fix: match CONGESTION signals against session_id history to survive re-election
…0-selective-decode

fix: cap large-meeting decode load to active layout
jboyd01 and others added 28 commits May 5, 2026 08:42
Resolve screen_encoder and health_reporter conflicts:
- Keep security-union#464's run_screen_encoding structure with start_with_stream
- Apply encoder error counter statics/getters and increments inside run_screen_encoding
- Keep security-union#513's HealthReporter::shutdown field/method/Weak ref pattern
- Rename frames_emitted -> frames_submitted_ok (Tony review feedback)
- Extract shared classify_encode_error helper with unit tests
…-auto-recovery

feat(client): encoder auto-recovery on closed-codec / VPX memory errors
…-error-metrics — resolve encoder counter conflicts
…e-token-based-styling-and-replace-hardcoded-values

Introduce token based styling and replace hardcoded values
…-error-metrics

feat(client): encoder error counters and frames-emitted in health packet
…ion-lost-reason-v3 — add encoder counter fields alongside connection-loss fields
…ion-lost-reason-v3

feat(client): distinguish handshake failures from session drops in connection loss
[stub 784] Cropping lost when users join
…ncoder-aq-metrics

fix(client): restore camera encoder shared AQ metrics
Adds a per-connection counter of consecutive implausible-RTT discards
(rejections from the `RTT_SANITY_MAX_MS` plausibility filter in
`handle_rtt_response`). The 1Hz `check_rtt_degradation` watchdog now
reads this counter on the active connection and triggers re-election
when the streak exceeds `REELECTION_IMPLAUSIBLE_DISCARDS_THRESHOLD` (10).
A single plausible measurement resets the streak.

This is defense-in-depth alongside PR-A: without it, a broken time-base
(client/server clock drift, NTP slew, virtualized-clock stretch) silently
drains the existing RTT detector of samples and leaves the user stuck on
a dead connection. Discussion security-union#539 documents one such incident: 255
implausible discards over 6 minutes on JRG_dirs (2026-05-05) where the
server-side clock ticked ~2.6x the client performance.now() rate.

Threshold of 10 means ~10 seconds of sustained brokenness at the 1Hz
post-election probe rate before re-election fires - long enough to ride
out a one-shot late ACK or NTP slew but short enough that the user does
not perceive the connection as dead. With only one server configured the
trigger is suppressed (re-election would just reproduce the same
brokenness) and the streak is reset so we do not log on every tick.

Verification: 8 new unit tests under
`connection_manager::tests::implausible_*` and `*_implausible_*`,
including 11-discard trigger, 10-discard non-trigger (boundary), 1+1+1
intermittent reset, single-server skip, and re-election-in-progress
guard. All pass on the previous worktree HEAD; the lib test target is
broken on PR-staging tip due to the unrelated security-union#526 issue. `cargo check`
plus `cargo clippy --target wasm32-unknown-unknown -- -D warnings` are
clean (default plus `--no-default-features`).

Refs: discussion security-union#539
…ailable

When the RTT-degradation watchdog fires but only one server is configured,
the existing rebase path silently adapts the baseline to the degraded RTT
and stays on the slow connection forever — even if conditions later improve.
Tony Estrada and Anhelina hit this in meeting_sync on 2026-05-05 (discussion
labs-projects/videocall#539): both rebased to >1s RTT and could only recover
by manually rejoining.

Adds a 30s re-election retry timer after each rebase (capped at 3 attempts,
~90s total budget). When the timer fires the manager re-evaluates whether
the candidate set has grown — e.g. dioxus-ui called update_server_urls
after refreshing the room token — and invokes start_reelection if so.
If the URL list is still single-server it schedules another retry until
the budget is exhausted. The counter is reset on reset_and_start_election
and on a successful complete_election so each new session/election starts
with a fresh retry budget.

The retry honours the user transport preference. The dioxus-ui already
exposes a TransportPreference context (Auto / WebTransportOnly /
WebSocketOnly, persisted in localStorage / sessionStorage and resolved at
URL-build time via resolve_transport_config). Manual WebTransportOnly /
WebSocketOnly selections produce a single-candidate URL list by design,
so the retry must not override that choice. To plumb this through the
connection manager (which only sees the post-filter URL list), this PR adds
allow_post_rebase_retry: bool to VideoCallClientOptions and
ConnectionManagerOptions. The dioxus-ui passes
transport_pref == TransportPreference::Auto for the main meeting client
and false for short-lived observer clients (waiting room, guest join,
meeting page).

The async timer body is gated to target_arch = "wasm32" because
gloo_timers and wasm_bindgen_futures::spawn_local panic outside the
browser. The retry decision logic is factored into a pure
decide_post_rebase_retry_action so the policy is host-test-safe.

Tests:
- post_rebase_retry_decision_* — pure policy assertions for every branch
  (Skip / FireElection / Reschedule).
- post_rebase_retry_increments_counter_when_allowed — rebase under Auto
  preference advances the retry counter.
- post_rebase_retry_not_scheduled_when_user_pref_forbids — manual
  transport preference suppresses the retry.
- post_rebase_retry_caps_at_max_attempts — budget cap is honoured.
- post_rebase_retry_counter_resets_on_reset_and_start_election — counter
  reset path is plumbed through the public API.

All 54 tests in connection::connection_manager::tests pass on host (run
locally; the peer_decode_manager test compilation blocker tracked as security-union#526
is independent of this change).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…eak doctests

Addresses two blockers from the @jay-boyd / @Antonio-Estrada review of security-union#542:

1. update_server_urls now propagates from VideoCallClient -> ConnectionController
   -> ConnectionManager so the post-rebase retry's total_server_count() sees
   refreshed candidate URLs. Without this, dioxus-ui calling update_server_urls
   after a token refresh left the retry timer reading stale single-server URLs
   forever.

2. Updated lib.rs doctests to include allow_post_rebase_retry and is_guest so
   cargo test --doc -p videocall-client compiles. (The is_guest field is from
   a separate prior change but the doctest was doubly-broken.)

Tests:
- update_server_urls_propagates_into_total_server_count locks in the propagation
- post_rebase_retry_decision_fires_after_url_propagation exercises the end-to-end
  invariant (single-server -> multi-server transition flips the retry decision)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ble-rtt-reelection-trigger

fix(client): treat sustained implausible RTT as a re-election trigger
…failure

When a re-election starts and ALL candidates fail before producing valid
RTT samples (the JRG_dirs Tony S1 incident on 2026-05-05 15:05:47 UTC,
discussion security-union#539), check whether the old active connection has had any
inbound traffic within the last 5 s. If yes, the candidates' failure is
treated as a transient relay-side outage and the old connection is
preserved instead of being torn down. A 30 s re-election retry is
scheduled to give the relay time to recover.

Tony's S1 in JRG_dirs was on a healthy WebTransport connection (69 ms
baseline RTT) when his RTT spiked to 1477 ms over 5 consecutive samples,
triggering automatic re-election. Both new candidates (wt_0_g1 and
ws_0_g1) failed handshake within 14 ms of each other (a brief
relay-side outage). The election deadline lapsed 4 s later, the system
declared "Election failed: No valid connections with RTT measurements
found", and the user was force-disconnected. He was absent for 16
minutes before manually rejoining.

The old wt_0 connection was still alive when the candidates failed
(still receiving media + heartbeat). With this fix, the old connection
stays put.

Implementation:
- Track last_inbound_at_ms per connection. Updated by the inbound media
  callback on every inbound packet (media, RTT echo, SESSION_ASSIGNED,
  heartbeat ACK).
- New helper try_preserve_old_connection_on_candidate_failure runs in
  the Err branch of complete_election. Returns true if preservation
  applies; the caller skips the existing Failed-state emission.
- Preservation conditions: re-election in progress, the
  reelection_preserved_once guard is false, the old active slot is
  populated, and the old connection's inbound freshness is within
  REELECTION_PRESERVATION_FRESHNESS_MS (5 s).
- Preservation actions: restore the old connection to
  self.connections, restore its RTT measurement, set election state to
  Elected on the old id, rebase the degradation baseline, close failed
  candidates, schedule a REELECTION_PRESERVATION_RETRY_MS (30 s) retry,
  and emit Connected (NOT Failed).
- Anti-loop guard: reelection_preserved_once is set on preservation,
  cleared only when a re-election cycle reaches a clean conclusion
  (Elected or aborted-on-no-improvement). If the 30 s retry's election
  ALSO fails total-candidate-failure, the guard forces the failure path
  to fall through to disconnect — guaranteeing a genuinely dead
  connection cannot be pinned indefinitely.

Tests:
- Fall-through when the old connection is silent past the 5 s window
- Fall-through when no inbound traffic was ever recorded
- Fall-through when not in a re-election (initial-election failure path)
- Fall-through when the preservation guard is already set
- 4.99 s vs 5.01 s freshness boundary checks
- start_reelection retains only the old active's freshness entry
- reset_and_start_election and disconnect both clear preservation
  state cleanly
- Constant-value documentation tests so future tuning is intentional

Refs: security-union#539 (JRG_dirs analysis), Tony Estrada S1 timeline 2026-05-05
15:05:47 UTC.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addresses the blocker from the @jay-boyd / @Antonio-Estrada review of security-union#544.

The flag was correctly cleared in reset_and_start_election:410 but missed
in three other paths where re-election cleanly concludes:

  1. start_reelection — after the in-progress early-return guard
  2. complete_election abort-on-no-improvement
  3. complete_election Elected success branch

Without these clears, a 30s preservation-retry timer armed by a prior
candidate-failure event could fire spuriously on a just-elected healthy
connection. The reelection_in_progress guard inside start_reelection
only absorbs the race when the new cycle is still running at retry time
— once the new cycle completes, the stale timer wakes through and
triggers unnecessary churn.

Pattern matches the existing cleanup in reset_and_start_election. No
new logic, just hoisting the cleanup invariant to the missing sites.

Tests:
  - start_reelection_clears_pending_preservation_retry
  - complete_election_elected_branch_clears_pending_preservation_retry
  - complete_election_abort_no_improvement_branch_clears_pending_preservation_retry

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Resolves the additive conflicts from security-union#541 in:
- videocall-client/src/adaptive_quality_constants.rs
- videocall-client/src/connection/connection_manager.rs

Both files were additive in adjacent regions; resolution = keep both sides.
The PR-staging branch's REELECTION_IMPLAUSIBLE_DISCARDS_THRESHOLD constant
and the corresponding import are kept alongside this branch's POST_REBASE_*
constants and update_server_urls method.
…base-reelection-retry

feat(client): post-rebase re-election retry when candidates were unavailable
…ailure

Resolves conflicts introduced by security-union#542 [post-rebase re-election retry]
landing on PR-staging. Both PRs are additive in adjacent regions of
adaptive_quality_constants.rs and connection_manager.rs - struct fields,
constructors, and test module. Resolution is "keep both" throughout;
no semantic clash between security-union#542's post-rebase retry machinery and this
PR's old-connection-preservation machinery.

Verifications at the merge head:
- cargo fmt --all -- --check  [clean]
- cargo check --target wasm32-unknown-unknown -p videocall-client
  [default + --no-default-features, both clean]
- cargo clippy --target wasm32-unknown-unknown -p videocall-client
  -- -D warnings  [default + --no-default-features, both clean]
- cargo test --doc -p videocall-client  [7/7 passing]

Invariant greps post-merge:
- 5 reelection_retry_pending clear-sites [this PR]
- 22 consecutive_implausible_discards references [security-union#541, intact]
- 3 update_server_urls definitions across the 3-layer plumbing [security-union#542]
- All 5 expected test names from both PRs survived
…on-preserve-old-on-total-failure

fix(client): re-election preserves old connection on total candidate failure
…ignal-quality popups

Each client now stamps its active transport into the periodic
HeartbeatMetadata (new TransportType enum, field 5). Receivers track
the latest value per peer and forward it through the existing
peer_status DiagEvent as a peer_transport text metric.

Two UI surfaces consume it:

- The diagnostics popup's Per-Peer Summary renders a compact WT / WS
  / em-dash pill next to each peer's buffer/jitter line.
- The signal-quality popup (chart popover behind each peer's signal-
  bars icon) renders the same pill in its header next to the peer
  name.

Signal writes are gated on actual change so heartbeat ticks (~1 Hz
per peer) don't trigger UI re-renders. Backward compatible: peers on
older clients arrive as TRANSPORT_UNKNOWN and render as em-dash.

E2E coverage in diagnostics-peer-transport.spec.ts (3-browser, gated
on >= 2 remote peers) and signal-quality-peer-transport.spec.ts
(2-browser, opens popup via aria-label="Show signal quality").

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ic-tweaks

feat(client,ui): per-peer transport (WT/WS) badge in diagnostics & signal popups
@testradav testradav changed the title RnD sync — April 28 to May 1 RnD sync — April 28 to May 5 May 6, 2026
@testradav
Copy link
Copy Markdown
Collaborator Author

Pushed a new # May 5 section at the top of the body — covers 11 HCL-internal PRs that landed on PR-staging since the prior sync (encoder auto-recovery + observability, re-election preserve/retry hardening, token-based styling, crop persistence, ALLOW_ANONYMOUS default, per-peer WT/WS badge). Build verified locally on meeting-api + videocall-client + videocall-ui wasm targets at merge head 22476ab.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants