[GSoC Interest] Hands-Free Multimodal Voice Mode: Architecture Questions #19897
vijayabhaskar78
started this conversation in
Ideas
Replies: 2 comments 4 replies
-
Beta Was this translation helpful? Give feedback.
4 replies
-
|
From my point of view, isolating the live voice session is the cleaner default because the lifecycle and failure modes are different enough from the normal request response path that forcing them together too early creates more coupling than benefit. I would also keep the first milestone narrow around push to talk, typed and spoken session continuity, and safe tool access before trying to optimize packaging decisions too much. If that orchestration layer is solid, the specific audio stack becomes easier to swap later. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @bdmorgan
I'm Vijaya Bhaskar, exploring contributing to Hands-Free Multimodal Voice Mode for GSoC. I've set up the repo locally, studied the codebase end-to-end (core, cli, devtools, settings, hooks), and have been working through open issues. Before I write my proposal, I'd love the team's input on a few architectural questions — I want to align with how you'd like this built.
Q1: Voice session isolation The Gemini Live API uses a persistent WebSocket (fundamentally different from the existing request/response
ContentGenerator). Should voice mode maintain its own isolated session (similar to how devtoolsService is self-contained) — or is there a preference to eventually wire it through the existing GeminiClient/GeminiChat abstraction?
Q2: Audio I/O approach The official Gemini docs suggest mic + speaker (requires SoX/ALSA as a system dependency). Are you open to shipping with a system-level dependency like SoX, or would you prefer a pure-Node solution (e.g. prebuilt native bindings with no external install)?
Q3: Package structure Would you prefer voice mode code live inside the existing packages/core and packages/cli (following the devtools/MCP pattern), or would a new packages/voice workspace be more appropriate given the scope?
Q4: Tool/agent access in voice mode The Live API supports native function calling. Should voice mode eventually have full access to the existing tool registry (shell, file ops, etc.), or start scoped to voice I/O only?
I have a detailed plan ready and happy to open a tracking issue if that's preferred. Thanks in advance!
Thanks,
Vijaya Bhaskar
Beta Was this translation helpful? Give feedback.
All reactions