GSOC : Hands-Free Multimodal Voice Mode project #20770
sakshisemalti
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @bdmorgan
I'm Sakshi and interested in the Hands-Free Multimodal Voice Mode project.
I've worked with real-time audio before. I built Multilingual Mandi, a voice negotiation platform supporting 10 Indian languages with sub-second WebSocket-based translation and a voice-first UI. So that's the reason I got interested in this project, low-latency audio streaming and conversation state are things I've dealt with hands-on.
I've set up the repo locally, got sandboxing running on macOS with Seatbelt and have been going through the codebase. I have few architectural questions before I start drafting my proposal:
For session isolation: should voice mode maintain its own isolated Live API WebSocket session or is the preference to wire it through the existing GeminiClient abstraction?
For audio I/O: are you open to a system-level dependency like SoX or is the preference a pure-Node solution with prebuilt native bindings?
For interruption handling: should the client send a signal to cancel the current response mid-stream or does the Live API handle this natively through VAD?
I've been doing some local analysis of the codebase and will share findings once I have clarity on these. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions