Skip to content

Desktop: move proactive AI to /v4/listen, remove GEMINI_API_KEY #5396

@beastoin

Description

@beastoin

Problem

Desktop macOS app bundles GEMINI_API_KEY in plain text .env and calls Google Gemini APIs directly from the client for all proactive AI features:

  • GeminiClient.swift (1,450 lines) — 9 callers across ProactiveAssistants + LiveNotes. Calls generativelanguage.googleapis.com/v1beta/models/{model}:generateContent?key=<KEY>. Uses structured JSON output, tool-calling loops, image+text, streaming SSE.
  • EmbeddingService.swift (315 lines) — Calls embedContent and batchEmbedContents with key in URL. Used by OCREmbeddingService + TaskAssistant.
  • Local SQLite stores all results (tasks, memories, focus sessions, embeddings) — should use Firestore/Pinecone like mobile.

Security risks: Same as #5393 Phase 1 (extractable keys, no per-user attribution, blast radius = full vendor billing).

Architectural inconsistency: Mobile routes ALL AI through backend. Desktop bypasses backend entirely, duplicating server-side capabilities that already exist in production.

Proposed Solution

Extend /v4/listen WebSocket to handle desktop's proactive AI needs. Desktop becomes a thin client — same pattern as mobile.

Why /v4/listen (not new endpoints)

New WebSocket Message Types

Client → Server:

Message Type Purpose Payload
screen_frame Screenshot for analysis {frame_id, image_b64, app_name, window_title, ocr_text?, analyze: ["focus","tasks","memories","advice"]}
live_notes_text Transcript → note {text, session_context}
profile_request Generate user profile {}
task_rerank Re-prioritize tasks {}
task_dedup Deduplicate tasks {}

Server → Client:

Message Type Purpose Payload
focus_result Focus detection {frame_id, status, app_or_site, description, message}
tasks_extracted Tasks from screenshot {frame_id, tasks: [{id, description, priority, confidence, source_app, due_at}]}
memories_extracted Memories from screenshot {frame_id, memories: [{id, content, category, confidence}]}
advice_extracted Proactive advice {frame_id, advice: {id, content, category, confidence}}
live_note Generated note {text}
profile_updated User profile {profile_text}
rerank_complete Tasks re-ranked {updated_tasks: [{id, new_position}]}
dedup_complete Duplicates removed {deleted_ids, reason}

Storage Migration

Desktop SQLite → Cloud Storage Status
action_items users/{uid}/action_items (Firestore) EXISTS
memories (incl. advice) users/{uid}/memories (Firestore) EXISTS
conversations users/{uid}/conversations (Firestore) EXISTS
goals users/{uid}/goals (Firestore) EXISTS
focus_sessions users/{uid}/focus_sessions (Firestore) NEW
action_items.embedding Pinecone vectors REUSE existing infra
screenshots.embedding Pinecone ns3 REUSE (already syncs)

Backend Reuse

Desktop Feature Backend Equivalent (PRODUCTION)
Memory extraction new_memories_extractor() in utils/llm/memories.py
Action item extraction + dedup extract_action_items() in utils/llm/conversation_processing.py
Goal progress detection extract_and_update_goal_progress() in utils/llm/goals.py
User profile Persona generation in utils/llm/persona.py
Data protection AES-256-GCM encryption in utils/encryption.py
Vector search Pinecone via database/vector_db.py

New backend work: Vision LLM handlers for screenshot analysis (focus, task extraction, memory extraction, advice).

Subtasks

Backend (Python)

  • Add message dispatcher for new types in _stream_handler() (transcribe.py)
  • Implement handle_screen_frame() — routes to analysis handlers in parallel
  • Implement focus analysis (vision LLM → focus_result)
  • Implement task extraction (vision LLM + Firestore dedup + Pinecone similarity → tasks_extracted)
  • Implement memory extraction from screenshots (vision LLM → memories_extracted)
  • Implement advice extraction (vision LLM → advice_extracted)
  • Implement live notes handler (text LLM → live_note)
  • Implement task re-ranking handler (Firestore fetch + LLM → rerank_complete)
  • Implement task dedup handler (Firestore + Pinecone + LLM → dedup_complete)
  • Implement profile generation handler (multi-source fetch + LLM → profile_updated)
  • Add focus_sessions Firestore collection with data protection decorators
  • Add frame_id + idempotency for duplicate frame handling

Desktop (Swift)

  • Add sendJSON() method to BackendTranscriptionService for text messages
  • Add response handlers for all new server→client message types
  • Rewrite 9 assistants as thin WebSocket message senders
  • Remove GeminiClient.swift (1,450 lines)
  • Remove EmbeddingService.swift (315 lines)
  • Remove GEMINI_API_KEY from .env.example and loadEnvironment()
  • Replace local SQLite reads with Firestore-cached data where applicable

Testing

  • End-to-end test per analysis type (focus, tasks, memories, advice, notes)
  • Latency benchmarks (focus detection target: <3s including network hop)
  • Load test screenshot bandwidth (adaptive quality/cadence)

Codex Review Summary

Scores: Correctness 6/10, Simplicity 3/10, Completeness 5/10

Key gaps to address during implementation:

  1. Protocol versioning and typed schemas per message type
  2. Backpressure — audio and vision on same WS need priority lanes
  3. Bandwidth strategy — adaptive screenshot quality/cadence, skip unchanged context
  4. Failure modes — partial outages, retries, idempotency
  5. Monolith risk — refactor transcribe.py into message dispatcher + per-capability handlers
  6. Local cache for offline mode (SQLite stays as cache, Firestore is source of truth)
  7. Privacy controls — PII/sensitive window filtering, user consent for screenshot upload

References

by AI for @beastoin

Metadata

Metadata

Assignees

No one assigned

    Labels

    desktopp2Priority: Important (score 14-21)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions