Replies: 3 comments 1 reply
-
|
Update: working prototype Since posting this, I've built a complete proof-of-concept: https://github.com/krishdef7/Voice-Orchestrator-Prototype The prototype focuses on the orchestration layer, the part that's hardest to get right. Three concurrent streams (voice input, model output, tool execution) coordinated through a 4-state FSM with guaranteed barge-in from any state and AbortController-gated tool cancellation mid-flight. The integration layer (gemini-cli-integration.ts) maps directly onto gemini-cli's actual interfaces, ToolRegistry.getToolSchemas(), BaseDeclarativeTool.createInvocation(), CoreToolScheduler's validate → confirm → execute pipeline, so the existing codebase needs minimal changes. (I have a merged PR in v0.32.0 - #20419 - so the codebase isn't new to me.) 219 tests across 5 modules, 0 failures, 0 TypeScript errors. npm run demo runs without an API key if you want to see the three orchestration scenarios. |
Beta Was this translation helpful? Give feedback.
-
|
From my point of view, the most convincing part here is the focus on orchestration rather than only on audio transport. A proof of concept that already exercises interruption, tool cancellation, and concurrent streams is exactly the kind of evidence that makes a voice proposal feel realistic. I would still keep the first integration milestone tight around push to talk, reliable state transitions, and spoken summaries of tool output before expanding into more ambitious activation modes. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @bdmorgan, the prototype is now complete and tested if you want to take a look: https://github.com/krishdef7/Voice-Orchestrator-Prototype |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @bdmorgan,
I'm Krish Garg, a 3rd-year student at IIT Roorkee with research experience in deep learning and AI systems. I'm preparing a GSoC 2026 proposal for Project #11 and wanted to share my architectural thinking early to get your feedback before finalizing.
Why Native Live API — Not Another Whisper Wrapper
The existing community solutions (#1982, #6929, PR #18499, and the
voice-modeMCP) all follow the same pattern: record audio → send to Whisper/OpenAI STT → inject text into the CLI prompt. This fundamentally limits the experience:The GSoC project description explicitly calls for "Gemini's native multimodal audio capabilities" — meaning
client.aio.live.connect()withBidiGenerateContentstreaming. This is a fundamentally different architecture that none of the existing PRs or community tools implement.Proposed Architecture
Core service —
packages/core/src/voice/VoiceModeService.tsA dedicated Live API WebSocket session usinggemini-2.5-flash-native-audio-preview, fully isolated from the existingGeminiClientHTTP session. Input: 16-bit PCM at 16kHz. Output: 24kHz audio playback.Audio I/O —
naudiodon(Node-API/N-API bindings to PortAudio) rather than SoX/ALSA. N-API provides a stable ABI across Node.js versions, avoiding the brittle shell-exec pattern SoX requires and the version compatibility issues it introduces in the CI pipeline.VAD & Interruption — Leverage the Live API's built-in server-side VAD rather than shipping a client-side Silero model or energy-threshold logic. For barge-in, the client sends a
BidiGenerateContentClientContentmessage withturn_complete: falseto signal interruption — this aligns with the Live API's native interrupt model rather than a separatecancel_generationframe.Activation —
/voiceslash command as entry point.Ink UI —
packages/cli/src/ui/components/VoiceMode.tsxAnimated waveform with listening / speaking / processing states using React state + Ink's
Box. No new workspace package needed.Tool integration — Register existing
ToolRegistrytools on theLive session's function calling interface in Milestone 2, enabling file ops and shell commands mid-voice-conversation.
Questions for You
I noticed issue Audio input instead of typing #1982 carries the
🔒 maintainer onlylabel — is there internal groundwork already laid that the GSoC contributor should build on top of, rather than starting fresh?Is
naudiodon(PortAudio N-API bindings) acceptable as a native dependency, or is there a preference for a pure-JS audio approach?Audio hardware access (mic + speakers) requires exemption from the Seatbelt/Docker sandbox since
/dev/sndand mic devices are blocked by default. Should this be handled via a new sandbox capability flag, or should voice mode document that it requires--no-sandbox?Happy to share a full written proposal doc. Looking forward to your feedback.
— Krish Garg (@krishdef7)
Beta Was this translation helpful? Give feedback.
All reactions