Real-time voice agent powered by Gemini Live API
Jemmie is a real-time voice agent backend built for the Gemini Live Agent Challenge. It delivers sub-second audio latency through the Gemini Live API, providing WebSocket-based infrastructure for natural voice interaction with session persistence and action handling.
The system handles bidirectional audio streaming, visual context processing, and stateful session management through a layered architecture designed for extensibility and testability. Each layer has a single responsibility: the gateway manages WebSocket lifecycle, the engine routes frames through a state machine, pipelines transform audio and image data, the agent layer interfaces with Gemini Live API, and the persistence layer maintains session state.
flowchart TB
subgraph CLIENT["Client Layer"]
direction LR
APP["Mobile App<br/><sub>iOS / Android</sub>"]
end
subgraph GCP["Google Cloud Platform"]
direction TB
subgraph GATEWAY["Gateway Layer"]
WS["ConnectionGateway<br/><sub>WebSocket Lifecycle</sub>"]
end
subgraph ENGINE["Engine Layer"]
FE["FrameEngine<br/><sub>State Machine</sub>"]
CC["ConnectionContext<br/><sub>Shared State</sub>"]
end
subgraph PIPELINE["Pipeline Layer"]
direction LR
AUDIO["Audio<br/><sub>16kHz to 24kHz</sub>"]
IMAGE["Image<br/><sub>JPEG Processing</sub>"]
ACTION["Action<br/><sub>Event Dispatch</sub>"]
end
subgraph AGENT["Agent Layer"]
LS["LiveSession<br/><sub>Streaming Manager</sub>"]
ROUTER["Tool Router<br/><sub>9 Built-in Tools</sub>"]
SEARCH["Web Search<br/><sub>Grounding</sub>"]
end
subgraph PERSIST["Persistence"]
SM["Session Manager"]
FIRESTORE[("Firestore<br/><sub>Session State</sub>")]
end
end
subgraph GEMINI["Google AI"]
GEM["Gemini Live API<br/><sub>gemini-live-2.5-flash</sub>"]
end
subgraph EXTERNAL["External Services"]
CSE["Custom Search API<br/><sub>Web Grounding</sub>"]
end
APP <-->|"wss:// Secure WebSocket"| WS
WS --> FE
FE --> CC
CC --> AUDIO & IMAGE & ACTION
AUDIO & IMAGE --> LS
ACTION --> ROUTER
ROUTER --> SEARCH
LS <-->|"Real-time Audio<br/>Function Calls"| GEM
SEARCH -->|"Search Results"| GEM
SEARCH -.->|"JSON API"| CSE
FE --> SM
SM <--> FIRESTORE
The system follows a layered architecture designed for real-time voice interaction with sub-second latency:
| Layer | Responsibility | Components |
|---|---|---|
| Gateway | WebSocket lifecycle, connection management | ConnectionGateway |
| Engine | Frame routing via state machine (IDLE→CONNECTED→ACTIVE→DRAINING→CLOSED) | FrameEngine, ConnectionContext |
| Pipeline | Data transformation between client and Gemini formats | Audio, Image, Action pipelines |
| Agent | Gemini Live API integration, tool execution | LiveSession, Tool Router, Web Search |
| Persistence | Session state with 10-minute resumption window | Session Manager, Firestore |
| Tool | Type | Description |
|---|---|---|
SET_TIMER |
Client-bound | Triggers countdown timer on device |
END_CALL |
Client-bound | Gracefully terminates the session |
OPEN_URL |
Client-bound | Opens URLs/maps on device |
FETCH_LOCATION |
Client-bound | Requests user's GPS location |
SET_REMINDER |
Client-bound | Schedules push notification |
REQUEST_BINARY_INPUT |
Client-bound | Yes/No via volume buttons |
REQUEST_CAMERA_PREVIEW |
Client-bound | Captures photo from camera |
COPY_TO_CLIPBOARD |
Client-bound | Copies text to device clipboard |
WEB_SEARCH |
Model-bound | Grounding with real-time web data |
Layer Details:
-
Gateway Layer (
src/gateway/): Handles WebSocket connection acceptance, lifecycle management, and guaranteed cleanup on disconnect. It initializes the connection context and spawns the frame engine. -
Engine Layer (
src/engine/): Routes incoming frames through a state machine with five connection states:IDLE,CONNECTED,ACTIVE,DRAINING, andCLOSED. Frames received during non-active states are queued and processed once the connection becomes active. -
Pipeline Layer (
src/pipelines/): Transforms data between client and Gemini formats. Audio is resampled from 16kHz client input to 24kHz Gemini input, with output transcoded back. Image frames are validated and forwarded for multimodal understanding. Actions are dispatched to handlers or routed to the client. -
Agent Layer (
src/agent/): Manages the bidirectional streaming connection to Gemini Live API through the Google GenAI SDK. Tool handlers execute server-side logic (e.g.,WEB_SEARCHfor grounding) or send commands to the client (e.g.,SET_TIMER). -
Persistence Layer (
src/session/): Stores session state in Firestore with a device-as-identity pattern. Sessions can be resumed within a 10-minute window, allowing users to continue conversations across connection drops.
DevOps
- Bidirectional Audio Streaming: PCM audio with automatic format conversion between client (16kHz) and Gemini ( 24kHz) sample rates
- Stateful Session Management: Device-as-identity pattern with 10-minute resumption window for conversation continuity
- Action System: Server-to-client commands (SET_TIMER) and client-to-server events (SHARE_LOCATION) for interactive features
- Visual Context Support: JPEG frame processing for multimodal understanding with the live session
- Web Search Grounding: Real-time web data integration via Google Custom Search API for up-to-date information
- Graceful Degradation: Connection state machine with frame queuing and guaranteed cleanup on disconnect
This project uses uv for package management. Install it from astral.sh/uv if you don't
have it already.
Clone the repository and set up your environment:
git clone https://github.com/oadultradeepfield/jemmie-backend.git
cd jemmie-backend
cp .env.example .envEdit .env and add your Google API key from Google AI Studio. Then start the
development server:
make devThe WebSocket endpoint will be available at ws://localhost:8080/ws/{device_id}.
make check # Run linting and type checks
make test # Run test suiteThe .env.example file contains the minimal configuration for local development:
| Variable | Description | Default |
|---|---|---|
GOOGLE_API_KEY |
API key from Google AI Studio | Required for local dev |
GOOGLE_GENAI_USE_VERTEXAI |
Use Vertex AI instead of API key | TRUE (use FALSE for API key) |
GOOGLE_CLOUD_PROJECT |
GCP project ID | Required for Vertex AI mode |
GOOGLE_CLOUD_LOCATION |
GCP region | us-central1 (required for Gemini Live) |
GOOGLE_SEARCH_API_KEY |
Custom Search API key | Optional, for web grounding |
GOOGLE_SEARCH_ENGINE_ID |
Programmable Search Engine ID | Optional, for web grounding |
For local development with an API key, set GOOGLE_GENAI_USE_VERTEXAI=FALSE and provide your GOOGLE_API_KEY. For
production deployment on Cloud Run, the service account credentials are used automatically with Vertex AI.
This section provides a quick way to verify the backend functionality.
-
Clone and set up:
git clone https://github.com/oadultradeepfield/jemmie-backend.git cd jemmie-backend cp .env.example .env -
Add your Google API key to
.env:GOOGLE_API_KEY=your-api-key-from-aistudio GOOGLE_GENAI_USE_VERTEXAI=FALSE -
Run the server:
make dev
-
Verify the server is running:
curl http://localhost:8080/health # Expected: {"status":"healthy"}
The backend exposes a WebSocket endpoint at ws://localhost:8080/ws/{device_id}. You can test it with wscat:
# Install wscat if needed
npm install -g wscat
# Connect to the WebSocket
wscat -c ws://localhost:8080/ws/test-device
# Send a text message (triggers Gemini response)
{"type":"TEXT","payload":{"text":"Hello, what can you do?"}}
# Expected: Audio and text responses from the AITo test the web search feature, you need Google Custom Search API credentials:
- Enable Custom Search API in Google Cloud Console
- Create an API Key with Custom Search API access
- Create a Programmable Search Engine
at programmablesearchengine.google.com
- Set to "Search the entire web"
- Copy the Search Engine ID
- Add to
.env:GOOGLE_SEARCH_API_KEY=your-api-key GOOGLE_SEARCH_ENGINE_ID=your-engine-id - Restart the server and ask a time-sensitive question:
{"type":"TEXT","payload":{"text":"What's the latest news about AI today?"}}
make check # Linting + type checking
make test # Run all tests (118 tests)Expected output:
======================= 118 passed, 8 skipped in 17.25s =======================
The 8 skipped tests require a Firestore emulator for integration tests. To run them:
# Start Firestore emulator
gcloud emulators firestore start --host-port=localhost:8081 &
# Run tests
FIRESTORE_EMULATOR_HOST=localhost:8081 make testThe backend is designed to work with the Jemmie mobile app. For full end-to-end testing:
- Deploy this backend or run locally
- Use the Jemmie iOS app pointing to your backend URL
- Test voice conversation, camera features, and location sharing
The backend is deployed on Google Cloud Run:
https://jemmie-backend-XXXXX-uc.a.run.app
WebSocket endpoint: wss://jemmie-backend-XXXXX-uc.a.run.app/ws/{device_id}
Run the setup script to create required GCP resources:
make setup-infra PROJECT_ID=your-project-idThis creates an Artifact Registry repository for Docker images, a Firestore database for session storage, and a service account with the required permissions.
Add these secrets to your GitHub repository to enable automatic deployment on push to main:
GCP_PROJECT_ID: Your Google Cloud project IDGCP_SERVICE_ACCOUNT_KEY: Service account JSON key (from the setup script output)GOOGLE_SEARCH_API_KEY: Custom Search API key (optional, for web grounding)GOOGLE_SEARCH_ENGINE_ID: Programmable Search Engine ID (optional, for web grounding)
Distributed under the MIT License. See LICENSE for more information.
