Skip to content

oadultradeepfield/jemmie-backend

Repository files navigation

Jemmie Backend

Real-time voice agent powered by Gemini Live API

contributors last update stars license


About the Project

Jemmie is a real-time voice agent backend built for the Gemini Live Agent Challenge. It delivers sub-second audio latency through the Gemini Live API, providing WebSocket-based infrastructure for natural voice interaction with session persistence and action handling.

The system handles bidirectional audio streaming, visual context processing, and stateful session management through a layered architecture designed for extensibility and testability. Each layer has a single responsibility: the gateway manages WebSocket lifecycle, the engine routes frames through a state machine, pipelines transform audio and image data, the agent layer interfaces with Gemini Live API, and the persistence layer maintains session state.

Architecture

flowchart TB
    subgraph CLIENT["Client Layer"]
        direction LR
        APP["Mobile App<br/><sub>iOS / Android</sub>"]
    end

    subgraph GCP["Google Cloud Platform"]
        direction TB

        subgraph GATEWAY["Gateway Layer"]
            WS["ConnectionGateway<br/><sub>WebSocket Lifecycle</sub>"]
        end

        subgraph ENGINE["Engine Layer"]
            FE["FrameEngine<br/><sub>State Machine</sub>"]
            CC["ConnectionContext<br/><sub>Shared State</sub>"]
        end

        subgraph PIPELINE["Pipeline Layer"]
            direction LR
            AUDIO["Audio<br/><sub>16kHz to 24kHz</sub>"]
            IMAGE["Image<br/><sub>JPEG Processing</sub>"]
            ACTION["Action<br/><sub>Event Dispatch</sub>"]
        end

        subgraph AGENT["Agent Layer"]
            LS["LiveSession<br/><sub>Streaming Manager</sub>"]
            ROUTER["Tool Router<br/><sub>9 Built-in Tools</sub>"]
            SEARCH["Web Search<br/><sub>Grounding</sub>"]
        end

        subgraph PERSIST["Persistence"]
            SM["Session Manager"]
            FIRESTORE[("Firestore<br/><sub>Session State</sub>")]
        end
    end

    subgraph GEMINI["Google AI"]
        GEM["Gemini Live API<br/><sub>gemini-live-2.5-flash</sub>"]
    end

    subgraph EXTERNAL["External Services"]
        CSE["Custom Search API<br/><sub>Web Grounding</sub>"]
    end

    APP <-->|"wss:// Secure WebSocket"| WS
    WS --> FE
    FE --> CC
    CC --> AUDIO & IMAGE & ACTION
    AUDIO & IMAGE --> LS
    ACTION --> ROUTER
    ROUTER --> SEARCH
    LS <-->|"Real-time Audio<br/>Function Calls"| GEM
    SEARCH -->|"Search Results"| GEM
    SEARCH -.->|"JSON API"| CSE
    FE --> SM
    SM <--> FIRESTORE
Loading

Architecture Overview

The system follows a layered architecture designed for real-time voice interaction with sub-second latency:

Layer Responsibility Components
Gateway WebSocket lifecycle, connection management ConnectionGateway
Engine Frame routing via state machine (IDLE→CONNECTED→ACTIVE→DRAINING→CLOSED) FrameEngine, ConnectionContext
Pipeline Data transformation between client and Gemini formats Audio, Image, Action pipelines
Agent Gemini Live API integration, tool execution LiveSession, Tool Router, Web Search
Persistence Session state with 10-minute resumption window Session Manager, Firestore

Available Tools

Tool Type Description
SET_TIMER Client-bound Triggers countdown timer on device
END_CALL Client-bound Gracefully terminates the session
OPEN_URL Client-bound Opens URLs/maps on device
FETCH_LOCATION Client-bound Requests user's GPS location
SET_REMINDER Client-bound Schedules push notification
REQUEST_BINARY_INPUT Client-bound Yes/No via volume buttons
REQUEST_CAMERA_PREVIEW Client-bound Captures photo from camera
COPY_TO_CLIPBOARD Client-bound Copies text to device clipboard
WEB_SEARCH Model-bound Grounding with real-time web data

Layer Details:

  • Gateway Layer (src/gateway/): Handles WebSocket connection acceptance, lifecycle management, and guaranteed cleanup on disconnect. It initializes the connection context and spawns the frame engine.

  • Engine Layer (src/engine/): Routes incoming frames through a state machine with five connection states: IDLE, CONNECTED, ACTIVE, DRAINING, and CLOSED. Frames received during non-active states are queued and processed once the connection becomes active.

  • Pipeline Layer (src/pipelines/): Transforms data between client and Gemini formats. Audio is resampled from 16kHz client input to 24kHz Gemini input, with output transcoded back. Image frames are validated and forwarded for multimodal understanding. Actions are dispatched to handlers or routed to the client.

  • Agent Layer (src/agent/): Manages the bidirectional streaming connection to Gemini Live API through the Google GenAI SDK. Tool handlers execute server-side logic (e.g., WEB_SEARCH for grounding) or send commands to the client (e.g., SET_TIMER).

  • Persistence Layer (src/session/): Stores session state in Firestore with a device-as-identity pattern. Sessions can be resumed within a 10-minute window, allowing users to continue conversations across connection drops.

Tech Stack

Server
DevOps

Features

  • Bidirectional Audio Streaming: PCM audio with automatic format conversion between client (16kHz) and Gemini ( 24kHz) sample rates
  • Stateful Session Management: Device-as-identity pattern with 10-minute resumption window for conversation continuity
  • Action System: Server-to-client commands (SET_TIMER) and client-to-server events (SHARE_LOCATION) for interactive features
  • Visual Context Support: JPEG frame processing for multimodal understanding with the live session
  • Web Search Grounding: Real-time web data integration via Google Custom Search API for up-to-date information
  • Graceful Degradation: Connection state machine with frame queuing and guaranteed cleanup on disconnect

Getting Started

Prerequisites

This project uses uv for package management. Install it from astral.sh/uv if you don't have it already.

Installation

Clone the repository and set up your environment:

git clone https://github.com/oadultradeepfield/jemmie-backend.git
cd jemmie-backend
cp .env.example .env

Edit .env and add your Google API key from Google AI Studio. Then start the development server:

make dev

The WebSocket endpoint will be available at ws://localhost:8080/ws/{device_id}.

Running Tests

make check    # Run linting and type checks
make test     # Run test suite

Environment Variables

The .env.example file contains the minimal configuration for local development:

Variable Description Default
GOOGLE_API_KEY API key from Google AI Studio Required for local dev
GOOGLE_GENAI_USE_VERTEXAI Use Vertex AI instead of API key TRUE (use FALSE for API key)
GOOGLE_CLOUD_PROJECT GCP project ID Required for Vertex AI mode
GOOGLE_CLOUD_LOCATION GCP region us-central1 (required for Gemini Live)
GOOGLE_SEARCH_API_KEY Custom Search API key Optional, for web grounding
GOOGLE_SEARCH_ENGINE_ID Programmable Search Engine ID Optional, for web grounding

For local development with an API key, set GOOGLE_GENAI_USE_VERTEXAI=FALSE and provide your GOOGLE_API_KEY. For production deployment on Cloud Run, the service account credentials are used automatically with Vertex AI.

Testing Instructions for Judges

This section provides a quick way to verify the backend functionality.

Quick Start (2 minutes)

  1. Clone and set up:

    git clone https://github.com/oadultradeepfield/jemmie-backend.git
    cd jemmie-backend
    cp .env.example .env
  2. Add your Google API key to .env:

    GOOGLE_API_KEY=your-api-key-from-aistudio
    GOOGLE_GENAI_USE_VERTEXAI=FALSE
    
  3. Run the server:

    make dev
  4. Verify the server is running:

    curl http://localhost:8080/health
    # Expected: {"status":"healthy"}

Verify WebSocket Endpoint

The backend exposes a WebSocket endpoint at ws://localhost:8080/ws/{device_id}. You can test it with wscat:

# Install wscat if needed
npm install -g wscat

# Connect to the WebSocket
wscat -c ws://localhost:8080/ws/test-device

# Send a text message (triggers Gemini response)
{"type":"TEXT","payload":{"text":"Hello, what can you do?"}}

# Expected: Audio and text responses from the AI

Test Web Search (Grounding)

To test the web search feature, you need Google Custom Search API credentials:

  1. Enable Custom Search API in Google Cloud Console
  2. Create an API Key with Custom Search API access
  3. Create a Programmable Search Engine at programmablesearchengine.google.com
    • Set to "Search the entire web"
    • Copy the Search Engine ID
  4. Add to .env:
    GOOGLE_SEARCH_API_KEY=your-api-key
    GOOGLE_SEARCH_ENGINE_ID=your-engine-id
    
  5. Restart the server and ask a time-sensitive question:
    {"type":"TEXT","payload":{"text":"What's the latest news about AI today?"}}

Run the Test Suite

make check    # Linting + type checking
make test     # Run all tests (118 tests)

Expected output:

======================= 118 passed, 8 skipped in 17.25s =======================

The 8 skipped tests require a Firestore emulator for integration tests. To run them:

# Start Firestore emulator
gcloud emulators firestore start --host-port=localhost:8081 &

# Run tests
FIRESTORE_EMULATOR_HOST=localhost:8081 make test

Test with Frontend

The backend is designed to work with the Jemmie mobile app. For full end-to-end testing:

  1. Deploy this backend or run locally
  2. Use the Jemmie iOS app pointing to your backend URL
  3. Test voice conversation, camera features, and location sharing

Deployed Instance

The backend is deployed on Google Cloud Run:

https://jemmie-backend-XXXXX-uc.a.run.app

WebSocket endpoint: wss://jemmie-backend-XXXXX-uc.a.run.app/ws/{device_id}

Deployment

Infrastructure Setup

Run the setup script to create required GCP resources:

make setup-infra PROJECT_ID=your-project-id

This creates an Artifact Registry repository for Docker images, a Firestore database for session storage, and a service account with the required permissions.

GitHub Actions Deployment

Add these secrets to your GitHub repository to enable automatic deployment on push to main:

  • GCP_PROJECT_ID: Your Google Cloud project ID
  • GCP_SERVICE_ACCOUNT_KEY: Service account JSON key (from the setup script output)
  • GOOGLE_SEARCH_API_KEY: Custom Search API key (optional, for web grounding)
  • GOOGLE_SEARCH_ENGINE_ID: Programmable Search Engine ID (optional, for web grounding)

GCP Deployment Proof

Proof of deployment

License

Distributed under the MIT License. See LICENSE for more information.

About

A real-time voice agent powered by Gemini Live API.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages