GSOC : Hands-Free Multimodal Voice Mode project #20770

sakshisemalti · 2026-03-01T16:20:32Z

sakshisemalti
Mar 1, 2026

Hi @bdmorgan
I'm Sakshi and interested in the Hands-Free Multimodal Voice Mode project.
I've worked with real-time audio before. I built Multilingual Mandi, a voice negotiation platform supporting 10 Indian languages with sub-second WebSocket-based translation and a voice-first UI. So that's the reason I got interested in this project, low-latency audio streaming and conversation state are things I've dealt with hands-on.
I've set up the repo locally, got sandboxing running on macOS with Seatbelt and have been going through the codebase. I have few architectural questions before I start drafting my proposal:

For session isolation: should voice mode maintain its own isolated Live API WebSocket session or is the preference to wire it through the existing GeminiClient abstraction?
For audio I/O: are you open to a system-level dependency like SoX or is the preference a pure-Node solution with prebuilt native bindings?
For interruption handling: should the client send a signal to cancel the current response mid-stream or does the Live API handle this natively through VAD?

I've been doing some local analysis of the codebase and will share findings once I have clarity on these. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSOC : Hands-Free Multimodal Voice Mode project #20770

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

GSOC : Hands-Free Multimodal Voice Mode project #20770

Uh oh!

sakshisemalti Mar 1, 2026

Replies: 0 comments

sakshisemalti
Mar 1, 2026