Verify: Combined desktop PRs #5374 #5395 #5413 #5537#5539
Verify: Combined desktop PRs #5374 #5395 #5413 #5537#5539
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…5396) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WebSocket client that connects to /v4/listen with Bearer auth and sends screen_frame JSON messages. Routes focus_result responses back to callers via async continuations with frame_id correlation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
#5396) Replace direct Gemini API calls with backend WebSocket screen_frame messages. Context building (goals, tasks, memories, AI profile) moves server-side. Client becomes thin: encode JPEG→base64, send screen_frame, receive focus_result. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…#5396) Start WS connection when monitoring starts, disconnect on stop. Pass service to FocusAssistant (shared for future assistant types). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…5396) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Vision handlers: analyzeFocus, extractTasks, extractMemories, generateAdvice (send screen_frame with analyze type, receive typed result via frame_id) Text handlers: generateLiveNote, requestProfile, rerankTasks, deduplicateTasks (send typed JSON message, receive result via single-slot continuation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace GeminiClient tool-calling loop with backendService.extractTasks(). Remove extractTaskSingleStage, refreshContext, vector/keyword search, validateTaskTitle — all LLM logic now server-side. -550 lines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace GeminiClient.sendRequest with backendService.extractMemories(). Remove prompt/schema building — all LLM logic now server-side. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace 2-phase Gemini tool-calling loop (execute_sql + vision) with backendService.generateAdvice(). Remove compressForGemini, getUserLanguage, buildActivitySummary, buildPhase1/2Tools — all LLM logic server-side. -560 lines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace GeminiClient with backendService.deduplicateTasks(). Remove prompt/schema building, local dedup logic — server handles everything. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace GeminiClient with backendService.rerankTasks(). Remove prompt/ schema building, context fetching — server handles reranking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace GeminiClient with backendService.rerankTasks(). Remove prompt/ schema building, context fetching — server handles reranking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace 2-stage Gemini profile generation with backendService.requestProfile(). Remove fetchDataSources, buildPrompt, buildConsolidationPrompt — server fetches user data from Firestore and generates profile server-side. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ts (#5396) Pass shared BackendProactiveService to all 4 assistants and 3 text-only services. Remove do/catch since inits no longer throw. Update AdviceTestRunnerWindow fallback creation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace direct GeminiClient usage with BackendProactiveService. Uses configure(backendService:) singleton pattern matching other text-based services. Prompt logic moves server-side. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add configure(backendService:) call for LiveNotesMonitor alongside other singleton text-based services. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update GoogleService-Info-Dev.plist with dev Firebase values: API_KEY, PROJECT_ID, STORAGE_BUCKET, GCM_SENDER_ID, GOOGLE_APP_ID. Fixes #5536 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dev builds load GoogleService-Info-Dev.plist (via run.sh), prod builds load GoogleService-Info.plist. AuthService now reads API_KEY from whichever plist is in the bundle, with prod key as fallback. Fixes #5536 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dev.sh builds Omi Dev (com.omi.desktop-dev) but was copying the prod GoogleService-Info.plist. Now uses the same dev plist logic as run.sh. Fixes #5536 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
reset-and-run.sh builds Omi Dev (com.omi.desktop-dev) but was copying the prod GoogleService-Info.plist. Now uses the same dev plist logic as run.sh. Fixes #5536 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CODEx review: dev builds should not silently use prod credentials. Now logs a FATAL warning if GoogleService-Info.plist is missing or has no API_KEY in a dev build (bundle ID ending in -dev). Fixes #5536 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ck to prod CODEx review round 2: logging is not fail-fast. Dev builds now crash with fatalError if GoogleService-Info.plist has no API_KEY, preventing silent use of prod credentials. Prod builds still fall back safely. Fixes #5536 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Greptile SummaryThis combined verification branch merges four desktop PRs: Google OAuth desktop auth (#5374), Deepgram → Key findings:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Desktop as Desktop App
participant WS as /v4/listen WebSocket
participant Backend as FastAPI Backend
participant LLM as Vision LLM (Gemini Flash)
participant DB as Firestore
Note over Desktop,WS: BackendTranscriptionService (audio stream)
Desktop->>WS: PCM audio frames (binary)
WS->>Backend: _stream_handler
Backend-->>Desktop: [TranscriptSegment] JSON
Note over Desktop,WS: BackendProactiveService (proactive AI)
Desktop->>WS: {type: "screen_frame", image_b64, analyze: ["focus","tasks"]}
WS->>Backend: _stream_handler (same connection)
par focus analysis
Backend->>LLM: analyze_focus(uid, image_b64)
LLM-->>Backend: FocusResult
Backend-->>Desktop: {type: "focus_result", status, app_or_site}
and task extraction
Backend->>DB: get_action_items(uid)
Backend->>LLM: extract_tasks(uid, image_b64)
LLM-->>Backend: TaskExtractionResult
Backend-->>Desktop: {type: "tasks_extracted", tasks:[...]}
end
Desktop->>WS: {type: "task_dedup"}
Backend->>DB: get_action_items(uid)
Backend->>LLM: dedup_tasks(uid)
LLM-->>Backend: DedupResult
Backend-->>Desktop: {type: "dedup_complete", deleted_ids:[...]}
Desktop->>Backend: DELETE /v1/staged-tasks/{id} (per deleted_id)
Last reviewed commit: 1bb7195 |
| key=lambda d: d.total_seconds, | ||
| reverse=True, | ||
| )[:5] | ||
|
|
||
| return FocusStatsResponse( | ||
| date=date, | ||
| focused_minutes=focused_count, | ||
| distracted_minutes=distracted_count, |
There was a problem hiding this comment.
focused_minutes/distracted_minutes hold session counts, not duration
focused_minutes and distracted_minutes are being populated with session counts (focused_count / distracted_count), not actual durations in minutes. The duration in seconds is only tracked for distracted sessions (in distraction_map), but focused duration is never computed. Clients relying on focused_minutes to display time-based stats (e.g., "you were focused for 45 minutes") will show incorrect data — the count of focus sessions instead.
To fix, either compute actual minutes from duration_seconds like distractions do, or rename the fields to focused_count and distracted_count to match their actual semantics (the model already has separate focused_count/distracted_count fields that are also set to the same values):
| key=lambda d: d.total_seconds, | |
| reverse=True, | |
| )[:5] | |
| return FocusStatsResponse( | |
| date=date, | |
| focused_minutes=focused_count, | |
| distracted_minutes=distracted_count, | |
| return FocusStatsResponse( | |
| date=date, | |
| focused_minutes=sum( | |
| (s.get('duration_seconds') or 60) // 60 | |
| for s in sessions if s.get('status') == 'focused' | |
| ), | |
| distracted_minutes=sum( | |
| (s.get('duration_seconds') or 60) // 60 | |
| for s in sessions if s.get('status') == 'distracted' | |
| ), | |
| session_count=focused_count + distracted_count, | |
| focused_count=focused_count, | |
| distracted_count=distracted_count, | |
| top_distractions=top_distractions, | |
| ) |
| } catch { | ||
| logError("BackendTranscriptionService: Failed to get auth token", error: error) | ||
| self.onError?(BackendTranscriptionError.notSignedIn) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| private func connectWithToken(_ token: String, baseURL: String) { | ||
|
|
||
| // Convert http(s) to ws(s) | ||
| let wsBaseURL: String | ||
| if baseURL.hasPrefix("https://") { |
There was a problem hiding this comment.
isConnected flagged after a fixed 0.5 s delay, not on actual handshake completion
The 0.5 s asyncAfter is an approximation — the backend's /v4/listen WebSocket handshake can take longer on slow or congested connections, meaning onConnected fires and audio capture starts before the server has actually acknowledged the session. On the flip side, if the TCP-level connect fails within 0.5 s, webSocketTask?.state == .running can still be .running (the task is "running" as far as the URL session knows, regardless of HTTP 101 status).
The PR audit already flagged this as W-3. A more reliable approach is to treat the first successfully received message (any transcript or service_status event) as the real connection confirmation, or — if the backend sends a connected preamble — wait for that before setting isConnected = true and invoking onConnected.
As-is, under slow networks users will experience a brief window where audio is buffered client-side but not yet accepted server-side, and the keepalive / watchdog timers start from the wrong origin.
| logger.warning(f"screen_frame missing image_b64 {uid} {session_id}") | ||
| else: | ||
| # Fan out to parallel handlers per analyze type | ||
| if 'focus' in analyze_types: | ||
| async def _handle_focus(fid, img, app, wtitle): | ||
| try: | ||
| result = await analyze_focus(uid=uid, image_b64=img, app_name=app, window_title=wtitle) | ||
| _send_message_event(FocusResultEvent( |
There was a problem hiding this comment.
image_b64 passed to vision LLMs without any size guard
A single base64-encoded JPEG screenshot can easily be 200–500 KB (or larger for high-DPI displays). With no validation on image_b64 length here, a malformed or deliberately oversized payload will be forwarded verbatim to analyze_focus, extract_tasks, extract_memories, and generate_advice — all of which inline the data in a Gemini Flash image_url message. This can:
- Cause unexpected LLM API errors / quota burn on very large images.
- Allow any authenticated user to send arbitrarily large payloads that stall async workers.
The PR audit flagged this as W-12. Add a size cap before fanning out:
MAX_IMAGE_B64_BYTES = 1_500_000 # ~1.1 MB decoded ≈ 4 MP JPEG at high quality
if len(image_b64) > MAX_IMAGE_B64_BYTES:
logger.warning(f"screen_frame image_b64 too large ({len(image_b64)} bytes), dropping {uid} {session_id}")
else:
# fan out handlers|
|
||
| # Dedup: check for existing task with same description (case-insensitive) | ||
| normalized = description.lower() | ||
| for doc in ref.stream(): | ||
| existing = doc.to_dict() | ||
| if existing.get('deleted'): | ||
| continue | ||
| if existing.get('description', '').strip().lower() == normalized: | ||
| existing['id'] = doc.id | ||
| return _prepare_for_read(existing) |
There was a problem hiding this comment.
Full collection scan for description deduplication on every create_staged_task
ref.stream() fetches every non-deleted document in the user's staged_tasks collection to do a case-insensitive description match. For a user with a large backlog (hundreds of staged items accumulated over time) this results in an O(n) Firestore read on every creation call. The screen_frame handler fans out up to four parallel analyses per frame, each of which could call create_staged_task — making this a hot path.
Consider storing a lowercase-normalised description hash as a dedicated indexed field and querying with where('desc_hash', '==', normalized_hash) instead of streaming the entire collection. Alternatively, accept that exact-dedup at creation time is best-effort and rely on the existing promote-time dedup in the router, which only reads the first 20 staged tasks.
| } | ||
|
|
||
| response = requests.post(sign_in_url, json=payload) | ||
|
|
||
| if response.status_code == 200: | ||
| result = response.json() | ||
| firebase_uid = result.get('localId') | ||
| if firebase_uid: | ||
| logger.info(f"Firebase sign-in successful for {provider}, UID: {firebase_uid}") | ||
| else: | ||
| logger.warning( | ||
| f"Firebase REST API sign-in failed (status={response.status_code}), falling back to Admin SDK" | ||
| ) | ||
|
|
There was a problem hiding this comment.
Admin SDK fallback will always fail for OAuth provider tokens
firebase_admin.auth.verify_id_token(id_token) expects a Firebase ID token — a JWT signed by Firebase Auth with iss pointing to securetoken.google.com/{project}. However, the id_token here is the raw Google (or Apple) OAuth ID token from the provider's code-exchange response (iss = accounts.google.com). The Admin SDK will reject it because the signing keys and iss claim don't match Firebase Auth's expectations.
This means the fallback path — intended as a resilience measure when FIREBASE_API_KEY is unset or the REST API returns non-200 — will itself raise an exception, turning a potentially recoverable failure into a hard auth error for the user.
Fix options:
- Remove the fallback entirely and require
FIREBASE_API_KEYto always be configured (fail with a clear message if missing). - Extract the email from the provider token via Google's public tokeninfo endpoint instead of calling the Firebase Admin verifier, then proceed with the existing user-lookup / create logic.
Update: PR #5374 VM endpoint fix merged (40ae983)New #5374 HEAD: 40ae983 (was 78d15d2) What changed8 new commits adding agent VM creation + status fields:
Test Results (updated)
All 21 new VM tests pass. Zero new failures. test.sh conflict resolved (keep all entries). AncestryAll 4 sub-PR HEADs verified as ancestors:
|
Local Backend E2E Verification UpdateDate: 2026-03-10 Setup
Live Transcription TestPipeline: Desktop app →
App log evidence: Declarative E2E Flows (4/4 PASS)
Overall VerdictPASS — All 3 PRs (#5374 + #5395 + #5413) verified end-to-end with local backend running combined branch code. Desktop app correctly routes audio through Python backend |
E2E Video Evidence (5 videos, signed URLs valid 7 days)Flow 1: Auth & Session BootstrapKill app → relaunch → session restored (no re-auth) Flow 2: Live Audio TranscriptionDesktop app → local backend ws://localhost:8000/v4/listen → Deepgram → 35 segments Flow 3: Screen Analysis SettingsSettings > General, Rewind, Privacy — all rendered Flow 4: NavigationDashboard → Chat → Memories → Tasks → Apps → Settings → back Local Backend Transcription PipelineApp wired to |
Combined Verification Branch
Independent verification of 4 coupled desktop PRs merged in dependency order.
Verifier: noa (independent — not the author of any sub-PR)
Branch:
verify/noa-combined-5374-5395-5413-5537Merge order: #5374 → #5395 → #5413 → #5537
Sub-PRs
collab/5302-integration40ae983affix/desktop-stt-backend-5393e2a885734collab/5396-integration15bf1ec6acollab/5396-ren-focus6d8b57e8eAll sub-PR HEADs verified. kai confirmed no changes since last check (2026-03-10).
Verification Results
Backend Tests
Zero new failures. All 16 fails and 40 errors verified pre-existing on main.
Local Backend E2E (2026-03-10)
Set up local Python backend from combined branch on Mac Mini, wired desktop app to
http://localhost:8000.Pipeline verified end-to-end:
/v4/listenDeclarative E2E Flows (4/4 PASS)
Swift Build
Mac Mini E2E
Codex Audit
0 CRITICAL, 12 WARNING (all non-blocking)
Key warnings: W-1 (dual WebSocket to /v4/listen), W-2 (fatalError on missing OMI_API_URL), W-3 (isConnected timing), W-9 (no token refresh in long sessions), W-12 (screen_frame image_b64 not size-limited)
Cross-PR Interaction
Verdict
PASS — all 4 PRs. Zero test regressions, local backend E2E confirmed, clean Codex audit (0 CRITICAL). Ready for merge.