[GSoC 2026] Project #11 — Hands-Free Multimodal Voice Mode: Technical Proposal #20145

krishdef7 · 2026-02-24T08:01:17Z

krishdef7
Feb 24, 2026

I'm Krish Garg, a 3rd-year student at IIT Roorkee with research experience in deep learning and AI systems. I'm preparing a GSoC 2026 proposal for Project #11 and wanted to share my architectural thinking early to get your feedback before finalizing.

Why Native Live API — Not Another Whisper Wrapper

The existing community solutions (#1982, #6929, PR #18499, and the voice-mode MCP) all follow the same pattern: record audio → send to Whisper/OpenAI STT → inject text into the CLI prompt. This fundamentally limits the experience:

One-way or laggy: no true bidirectional streaming
External dependencies: separate OpenAI API key required
Not a real conversation: STT → text injection breaks natural flow
No barge-in: can't interrupt the model mid-response

The GSoC project description explicitly calls for "Gemini's native multimodal audio capabilities" — meaning client.aio.live.connect() with BidiGenerateContent streaming. This is a fundamentally different architecture that none of the existing PRs or community tools implement.

Proposed Architecture

Core service — packages/core/src/voice/VoiceModeService.ts A dedicated Live API WebSocket session using gemini-2.5-flash-native-audio-preview, fully isolated from the existing GeminiClient HTTP session. Input: 16-bit PCM at 16kHz. Output: 24kHz audio playback.

Audio I/O — naudiodon (Node-API/N-API bindings to PortAudio) rather than SoX/ALSA. N-API provides a stable ABI across Node.js versions, avoiding the brittle shell-exec pattern SoX requires and the version compatibility issues it introduces in the CI pipeline.

VAD & Interruption — Leverage the Live API's built-in server-side VAD rather than shipping a client-side Silero model or energy-threshold logic. For barge-in, the client sends a BidiGenerateContentClientContent message with turn_complete: false to signal interruption — this aligns with the Live API's native interrupt model rather than a separate cancel_generation frame.

Activation — /voice slash command as entry point.

Phase 1: Push-to-Talk (configurable hotkey)
Phase 2: Auto-VAD mode
Phase 3: Wake Word ("Hey Gemini") if timeline permits

Ink UI — packages/cli/src/ui/components/VoiceMode.tsx
Animated waveform with listening / speaking / processing states using React state + Ink's Box. No new workspace package needed.

Tool integration — Register existing ToolRegistry tools on the
Live session's function calling interface in Milestone 2, enabling file ops and shell commands mid-voice-conversation.

Questions for You

I noticed issue Audio input instead of typing #1982 carries the 🔒 maintainer only label — is there internal groundwork already laid that the GSoC contributor should build on top of, rather than starting fresh?
Is naudiodon (PortAudio N-API bindings) acceptable as a native dependency, or is there a preference for a pure-JS audio approach?
Audio hardware access (mic + speakers) requires exemption from the Seatbelt/Docker sandbox since /dev/snd and mic devices are blocked by default. Should this be handled via a new sandbox capability flag, or should voice mode document that it requires --no-sandbox?

Happy to share a full written proposal doc. Looking forward to your feedback.

— Krish Garg (@krishdef7)

krishdef7 · 2026-03-04T14:48:22Z

krishdef7
Mar 4, 2026
Author

Update: working prototype

Since posting this, I've built a complete proof-of-concept: https://github.com/krishdef7/Voice-Orchestrator-Prototype

The prototype focuses on the orchestration layer, the part that's hardest to get right. Three concurrent streams (voice input, model output, tool execution) coordinated through a 4-state FSM with guaranteed barge-in from any state and AbortController-gated tool cancellation mid-flight.

The integration layer (gemini-cli-integration.ts) maps directly onto gemini-cli's actual interfaces, ToolRegistry.getToolSchemas(), BaseDeclarativeTool.createInvocation(), CoreToolScheduler's validate → confirm → execute pipeline, so the existing codebase needs minimal changes. (I have a merged PR in v0.32.0 - #20419 - so the codebase isn't new to me.)

219 tests across 5 modules, 0 failures, 0 TypeScript errors. npm run demo runs without an API key if you want to see the three orchestration scenarios.

0 replies

aniruddhaadak80 · 2026-03-09T18:31:04Z

aniruddhaadak80
Mar 9, 2026

From my point of view, the most convincing part here is the focus on orchestration rather than only on audio transport. A proof of concept that already exercises interruption, tool cancellation, and concurrent streams is exactly the kind of evidence that makes a voice proposal feel realistic. I would still keep the first integration milestone tight around push to talk, reliable state transitions, and spoken summaries of tool output before expanding into more ambitious activation modes.

1 reply

krishdef7 Mar 10, 2026
Author

Thanks @aniruddhaadak80, the spoken summaries point is a good one I hadn't explicitly scoped into M1. It's actually a natural fit since the transcript manager already commits tool results as text before the model responds, surfacing that as a spoken summary just means routing committed tool output through the audio response rather than silently dropping it. Will incorporate that explicitly into the M1 milestone scope.

krishdef7 · 2026-03-10T03:39:56Z

krishdef7
Mar 10, 2026
Author

Hi @bdmorgan, the prototype is now complete and tested if you want to take a look: https://github.com/krishdef7/Voice-Orchestrator-Prototype
One question before I finalize the proposal: is there internal groundwork behind the 🔒 label on #1982 that the GSoC implementation should build on top of? That answer most affects the architecture, so I want to get it right before submitting.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSoC 2026] Project #11 — Hands-Free Multimodal Voice Mode: Technical Proposal #20145

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[GSoC 2026] Project #11 — Hands-Free Multimodal Voice Mode: Technical Proposal #20145

Uh oh!

krishdef7 Feb 24, 2026

Why Native Live API — Not Another Whisper Wrapper

Proposed Architecture

Questions for You

Replies: 3 comments · 1 reply

Uh oh!

krishdef7 Mar 4, 2026 Author

Uh oh!

aniruddhaadak80 Mar 9, 2026

Uh oh!

krishdef7 Mar 10, 2026 Author

Uh oh!

krishdef7 Mar 10, 2026 Author

krishdef7
Feb 24, 2026

Replies: 3 comments 1 reply

krishdef7
Mar 4, 2026
Author

aniruddhaadak80
Mar 9, 2026

krishdef7 Mar 10, 2026
Author

krishdef7
Mar 10, 2026
Author