GSoC Interest Idea #11 webaudio api! #19998
PranAD-dev
started this conversation in
Ideas
Replies: 1 comment
-
|
From my point of view, if the project goal is truly hands free multimodal voice mode, native Live API style streaming should be the center of the architecture and Whisper style transcription should stay in the fallback category rather than the main path. The real product question is less about audio capture alone and more about interruption, latency, and how voice shares context with the existing agent loop. Your background sounds relevant, but a narrow proof of concept around push to talk and bidirectional session flow would probably make the case strongest. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hey! I'm interested in contributing to the Hands-Free Multimodal Voice Mode idea for GSoC 2026.
I have past experience working with real-time audio and the webaudio api I built EchoEarth (It's in my github, the sole reason why I am not linking it here is that it would seem like a shameless plug), a spatial audio app where users have live voice conversations with AI-powered ecosystems. It uses Web Audio API (HRTF PannerNode, ConvolverNode, dynamic crossfading), and Gemini for generation. Won best use of ElevenLabs at SFHacks. I loved implementing the Web Audio with HRTF, I have some ideas that I would genuinely love to implement! (I just did the hackathon a week ago, and the overlap was surprising lol).
The overlap with this project is pretty direct and overlaps. My past project addresses the same core problems: real-time audio streaming, voice activity detection, managing conversation state, keeping latency low enough for fluid interaction, and Voice-to-Text for gemini to understand!
I've started exploring the Gemini CLI codebase and I'm working on my first contribution, and had a question, For the voice mode,
is the vision to use Gemini's native Live API for bidirectional audio streaming, and whisper or is it just whisper as stated on the docs!
@bdmorgan
Beta Was this translation helpful? Give feedback.
All reactions