[GSoC Interest] Hands-Free Multimodal Voice Mode: Architecture Questions #19897

vijayabhaskar78 · 2026-02-22T11:29:37Z

vijayabhaskar78
Feb 22, 2026

I'm Vijaya Bhaskar, exploring contributing to Hands-Free Multimodal Voice Mode for GSoC. I've set up the repo locally, studied the codebase end-to-end (core, cli, devtools, settings, hooks), and have been working through open issues. Before I write my proposal, I'd love the team's input on a few architectural questions — I want to align with how you'd like this built.

Q1: Voice session isolation The Gemini Live API uses a persistent WebSocket (fundamentally different from the existing request/response
ContentGenerator). Should voice mode maintain its own isolated session (similar to how devtoolsService is self-contained) — or is there a preference to eventually wire it through the existing GeminiClient/GeminiChat abstraction?

Q2: Audio I/O approach The official Gemini docs suggest mic + speaker (requires SoX/ALSA as a system dependency). Are you open to shipping with a system-level dependency like SoX, or would you prefer a pure-Node solution (e.g. prebuilt native bindings with no external install)?

Q3: Package structure Would you prefer voice mode code live inside the existing packages/core and packages/cli (following the devtools/MCP pattern), or would a new packages/voice workspace be more appropriate given the scope?

Q4: Tool/agent access in voice mode The Live API supports native function calling. Should voice mode eventually have full access to the existing tool registry (shell, file ops, etc.), or start scoped to voice I/O only?

I have a detailed plan ready and happy to open a tracking issue if that's preferred. Thanks in advance!

Thanks,
Vijaya Bhaskar

hscecoder · 2026-02-22T15:41:18Z

hscecoder
Feb 22, 2026

4 replies

vijayabhaskar78 Feb 22, 2026
Author

Well Google have released the list of the organizations on the GSOC site.

hscecoder Feb 22, 2026

I am taking about the Project list.

vijayabhaskar78 Feb 23, 2026
Author

Well under the GEMINI CLI Org page you can see View ideas list button click on that.

hscecoder Feb 23, 2026

Ohh Sorry I found now Thanks

aniruddhaadak80 · 2026-03-09T18:46:42Z

aniruddhaadak80
Mar 9, 2026

From my point of view, isolating the live voice session is the cleaner default because the lifecycle and failure modes are different enough from the normal request response path that forcing them together too early creates more coupling than benefit. I would also keep the first milestone narrow around push to talk, typed and spoken session continuity, and safe tool access before trying to optimize packaging decisions too much. If that orchestration layer is solid, the specific audio stack becomes easier to swap later.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSoC Interest] Hands-Free Multimodal Voice Mode: Architecture Questions #19897

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[GSoC Interest] Hands-Free Multimodal Voice Mode: Architecture Questions #19897

Uh oh!

Uh oh!

vijayabhaskar78 Feb 22, 2026

Replies: 2 comments · 4 replies

Uh oh!

hscecoder Feb 22, 2026

Uh oh!

vijayabhaskar78 Feb 22, 2026 Author

Uh oh!

hscecoder Feb 22, 2026

Uh oh!

vijayabhaskar78 Feb 23, 2026 Author

Uh oh!

hscecoder Feb 23, 2026

Uh oh!

aniruddhaadak80 Mar 9, 2026

vijayabhaskar78
Feb 22, 2026

Replies: 2 comments 4 replies

hscecoder
Feb 22, 2026

vijayabhaskar78 Feb 22, 2026
Author

vijayabhaskar78 Feb 23, 2026
Author

aniruddhaadak80
Mar 9, 2026