Jemmie Backend

Real-time voice agent powered by Gemini Live API

Deployment Proof | Getting Started | Report Bug

About the Project

Jemmie is a real-time voice agent backend built for the Gemini Live Agent Challenge. It delivers sub-second audio latency through the Gemini Live API, providing WebSocket-based infrastructure for natural voice interaction with session persistence and action handling.

The system handles bidirectional audio streaming, visual context processing, and stateful session management through a layered architecture designed for extensibility and testability. Each layer has a single responsibility: the gateway manages WebSocket lifecycle, the engine routes frames through a state machine, pipelines transform audio and image data, the agent layer interfaces with Gemini Live API, and the persistence layer maintains session state.

Architecture

flowchart TB
    subgraph CLIENT["Client Layer"]
        direction LR
        APP["Mobile App<br/><sub>iOS / Android</sub>"]
    end

    subgraph GCP["Google Cloud Platform"]
        direction TB

        subgraph GATEWAY["Gateway Layer"]
            WS["ConnectionGateway<br/><sub>WebSocket Lifecycle</sub>"]
        end

        subgraph ENGINE["Engine Layer"]
            FE["FrameEngine<br/><sub>State Machine</sub>"]
            CC["ConnectionContext<br/><sub>Shared State</sub>"]
        end

        subgraph PIPELINE["Pipeline Layer"]
            direction LR
            AUDIO["Audio<br/><sub>16kHz to 24kHz</sub>"]
            IMAGE["Image<br/><sub>JPEG Processing</sub>"]
            ACTION["Action<br/><sub>Event Dispatch</sub>"]
        end

        subgraph AGENT["Agent Layer"]
            LS["LiveSession<br/><sub>Streaming Manager</sub>"]
            ROUTER["Tool Router<br/><sub>9 Built-in Tools</sub>"]
            SEARCH["Web Search<br/><sub>Grounding</sub>"]
        end

        subgraph PERSIST["Persistence"]
            SM["Session Manager"]
            FIRESTORE[("Firestore<br/><sub>Session State</sub>")]
        end
    end

    subgraph GEMINI["Google AI"]
        GEM["Gemini Live API<br/><sub>gemini-live-2.5-flash</sub>"]
    end

    subgraph EXTERNAL["External Services"]
        CSE["Custom Search API<br/><sub>Web Grounding</sub>"]
    end

    APP <-->|"wss:// Secure WebSocket"| WS
    WS --> FE
    FE --> CC
    CC --> AUDIO & IMAGE & ACTION
    AUDIO & IMAGE --> LS
    ACTION --> ROUTER
    ROUTER --> SEARCH
    LS <-->|"Real-time Audio<br/>Function Calls"| GEM
    SEARCH -->|"Search Results"| GEM
    SEARCH -.->|"JSON API"| CSE
    FE --> SM
    SM <--> FIRESTORE

Architecture Overview

The system follows a layered architecture designed for real-time voice interaction with sub-second latency:

Layer	Responsibility	Components
Gateway	WebSocket lifecycle, connection management	ConnectionGateway
Engine	Frame routing via state machine (IDLE→CONNECTED→ACTIVE→DRAINING→CLOSED)	FrameEngine, ConnectionContext
Pipeline	Data transformation between client and Gemini formats	Audio, Image, Action pipelines
Agent	Gemini Live API integration, tool execution	LiveSession, Tool Router, Web Search
Persistence	Session state with 10-minute resumption window	Session Manager, Firestore

Available Tools

Tool	Type	Description
`SET_TIMER`	Client-bound	Triggers countdown timer on device
`END_CALL`	Client-bound	Gracefully terminates the session
`OPEN_URL`	Client-bound	Opens URLs/maps on device
`FETCH_LOCATION`	Client-bound	Requests user's GPS location
`SET_REMINDER`	Client-bound	Schedules push notification
`REQUEST_BINARY_INPUT`	Client-bound	Yes/No via volume buttons
`REQUEST_CAMERA_PREVIEW`	Client-bound	Captures photo from camera
`COPY_TO_CLIPBOARD`	Client-bound	Copies text to device clipboard
`WEB_SEARCH`	Model-bound	Grounding with real-time web data

Layer Details:

Gateway Layer (src/gateway/): Handles WebSocket connection acceptance, lifecycle management, and guaranteed cleanup on disconnect. It initializes the connection context and spawns the frame engine.
Engine Layer (src/engine/): Routes incoming frames through a state machine with five connection states: IDLE, CONNECTED, ACTIVE, DRAINING, and CLOSED. Frames received during non-active states are queued and processed once the connection becomes active.
Pipeline Layer (src/pipelines/): Transforms data between client and Gemini formats. Audio is resampled from 16kHz client input to 24kHz Gemini input, with output transcoded back. Image frames are validated and forwarded for multimodal understanding. Actions are dispatched to handlers or routed to the client.
Agent Layer (src/agent/): Manages the bidirectional streaming connection to Gemini Live API through the Google GenAI SDK. Tool handlers execute server-side logic (e.g., WEB_SEARCH for grounding) or send commands to the client (e.g., SET_TIMER).
Persistence Layer (src/session/): Stores session state in Firestore with a device-as-identity pattern. Sessions can be resumed within a 10-minute window, allowing users to continue conversations across connection drops.

Tech Stack

Server

DevOps

Features

Bidirectional Audio Streaming: PCM audio with automatic format conversion between client (16kHz) and Gemini ( 24kHz) sample rates
Stateful Session Management: Device-as-identity pattern with 10-minute resumption window for conversation continuity
Action System: Server-to-client commands (SET_TIMER) and client-to-server events (SHARE_LOCATION) for interactive features
Visual Context Support: JPEG frame processing for multimodal understanding with the live session
Web Search Grounding: Real-time web data integration via Google Custom Search API for up-to-date information
Graceful Degradation: Connection state machine with frame queuing and guaranteed cleanup on disconnect

Getting Started

Prerequisites

This project uses uv for package management. Install it from astral.sh/uv if you don't have it already.

Installation

Clone the repository and set up your environment:

git clone https://github.com/oadultradeepfield/jemmie-backend.git
cd jemmie-backend
cp .env.example .env

Edit .env and add your Google API key from Google AI Studio. Then start the development server:

make dev

The WebSocket endpoint will be available at ws://localhost:8080/ws/{device_id}.

Running Tests

make check    # Run linting and type checks
make test     # Run test suite

Environment Variables

The .env.example file contains the minimal configuration for local development:

Variable	Description	Default
`GOOGLE_API_KEY`	API key from Google AI Studio	Required for local dev
`GOOGLE_GENAI_USE_VERTEXAI`	Use Vertex AI instead of API key	`TRUE` (use `FALSE` for API key)
`GOOGLE_CLOUD_PROJECT`	GCP project ID	Required for Vertex AI mode
`GOOGLE_CLOUD_LOCATION`	GCP region	`us-central1` (required for Gemini Live)
`GOOGLE_SEARCH_API_KEY`	Custom Search API key	Optional, for web grounding
`GOOGLE_SEARCH_ENGINE_ID`	Programmable Search Engine ID	Optional, for web grounding

For local development with an API key, set GOOGLE_GENAI_USE_VERTEXAI=FALSE and provide your GOOGLE_API_KEY. For production deployment on Cloud Run, the service account credentials are used automatically with Vertex AI.

Testing Instructions for Judges

This section provides a quick way to verify the backend functionality.

Quick Start (2 minutes)

Clone and set up:

git clone https://github.com/oadultradeepfield/jemmie-backend.git
cd jemmie-backend
cp .env.example .env

Add your Google API key to .env:

GOOGLE_API_KEY=your-api-key-from-aistudio
GOOGLE_GENAI_USE_VERTEXAI=FALSE

Run the server:
```
make dev
```

Verify the server is running:

curl http://localhost:8080/health
# Expected: {"status":"healthy"}

Verify WebSocket Endpoint

The backend exposes a WebSocket endpoint at ws://localhost:8080/ws/{device_id}. You can test it with wscat:

# Install wscat if needed
npm install -g wscat

# Connect to the WebSocket
wscat -c ws://localhost:8080/ws/test-device

# Send a text message (triggers Gemini response)
{"type":"TEXT","payload":{"text":"Hello, what can you do?"}}

# Expected: Audio and text responses from the AI

Test Web Search (Grounding)

To test the web search feature, you need Google Custom Search API credentials:

Enable Custom Search API in Google Cloud Console
Create an API Key with Custom Search API access
Create a Programmable Search Engine at programmablesearchengine.google.com
- Set to "Search the entire web"
- Copy the Search Engine ID

Add to .env:

GOOGLE_SEARCH_API_KEY=your-api-key
GOOGLE_SEARCH_ENGINE_ID=your-engine-id

Restart the server and ask a time-sensitive question:

{"type":"TEXT","payload":{"text":"What's the latest news about AI today?"}}

Run the Test Suite

make check    # Linting + type checking
make test     # Run all tests (118 tests)

Expected output:

======================= 118 passed, 8 skipped in 17.25s =======================

The 8 skipped tests require a Firestore emulator for integration tests. To run them:

# Start Firestore emulator
gcloud emulators firestore start --host-port=localhost:8081 &

# Run tests
FIRESTORE_EMULATOR_HOST=localhost:8081 make test

Test with Frontend

The backend is designed to work with the Jemmie mobile app. For full end-to-end testing:

Deploy this backend or run locally
Use the Jemmie iOS app pointing to your backend URL
Test voice conversation, camera features, and location sharing

Deployed Instance

The backend is deployed on Google Cloud Run:

https://jemmie-backend-XXXXX-uc.a.run.app

WebSocket endpoint: wss://jemmie-backend-XXXXX-uc.a.run.app/ws/{device_id}

Deployment

Infrastructure Setup

Run the setup script to create required GCP resources:

make setup-infra PROJECT_ID=your-project-id

This creates an Artifact Registry repository for Docker images, a Firestore database for session storage, and a service account with the required permissions.

GitHub Actions Deployment

Add these secrets to your GitHub repository to enable automatic deployment on push to main:

GCP_PROJECT_ID: Your Google Cloud project ID
GCP_SERVICE_ACCOUNT_KEY: Service account JSON key (from the setup script output)
GOOGLE_SEARCH_API_KEY: Custom Search API key (optional, for web grounding)
GOOGLE_SEARCH_ENGINE_ID: Programmable Search Engine ID (optional, for web grounding)

GCP Deployment Proof

License

Distributed under the MIT License. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
.github/workflows		.github/workflows
docs		docs
plan		plan
scripts		scripts
src		src
test_data		test_data
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jemmie Backend

Deployment Proof | Getting Started | Report Bug

About the Project

Architecture

Architecture Overview

Available Tools

Tech Stack

Features

Getting Started

Prerequisites

Installation

Running Tests

Environment Variables

Testing Instructions for Judges

Quick Start (2 minutes)

Verify WebSocket Endpoint

Test Web Search (Grounding)

Run the Test Suite

Test with Frontend

Deployed Instance

Deployment

Infrastructure Setup

GitHub Actions Deployment

GCP Deployment Proof

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Jemmie Backend

Deployment Proof | Getting Started | Report Bug

About the Project

Architecture

Architecture Overview

Available Tools

Tech Stack

Features

Getting Started

Prerequisites

Installation

Running Tests

Environment Variables

Testing Instructions for Judges

Quick Start (2 minutes)

Verify WebSocket Endpoint

Test Web Search (Grounding)

Run the Test Suite

Test with Frontend

Deployed Instance

Deployment

Infrastructure Setup

GitHub Actions Deployment

GCP Deployment Proof

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages