1818</p >
1919
2020<h4 >
21- <a href =" #deployment " >View Demo </a >
21+ <a href =" #deployment " >Deployment Proof </a >
2222 <span > | </span >
23- <a href =" #getting-started " >Documentation </a >
23+ <a href =" #getting-started " >Getting Started </a >
2424 <span > | </span >
2525 <a href =" https://github.com/oadultradeepfield/jemmie-backend/issues/ " >Report Bug</a >
2626</h4 >
2727</div >
2828
2929<br />
3030
31- # Table of Contents
32-
33- - [ About the Project] ( #about-the-project )
34- - [ Architecture] ( #architecture )
35- - [ Tech Stack] ( #tech-stack )
36- - [ Features] ( #features )
37- - [ Environment Variables] ( #environment-variables )
38- - [ Getting Started] ( #getting-started )
39- - [ Prerequisites] ( #prerequisites )
40- - [ Installation] ( #installation )
41- - [ Running Tests] ( #running-tests )
42- - [ Deployment] ( #deployment )
43- - [ License] ( #license )
44-
4531## About the Project
4632
47- Jemmie is a real-time voice agent backend that delivers sub-second audio latency through the Gemini Live API. Built for
48- the [ Gemini Live Agent Challenge] ( https://geminiliveagentchallenge.devpost.com ) , it provides a WebSocket-based
49- infrastructure for natural voice interaction with session persistence and action handling.
33+ Jemmie is a real-time voice agent backend built for
34+ the [ Gemini Live Agent Challenge] ( https://geminiliveagentchallenge.devpost.com ) . It delivers sub-second audio latency
35+ through the Gemini Live API, providing WebSocket-based infrastructure for natural voice interaction with session
36+ persistence and action handling.
5037
51- The backend handles bidirectional audio streaming, visual context processing, and stateful session management with a
52- layered architecture designed for extensibility and testability.
38+ The system handles bidirectional audio streaming, visual context processing, and stateful session management through a
39+ layered architecture designed for extensibility and testability. Each layer has a single responsibility: the gateway
40+ manages WebSocket lifecycle, the engine routes frames through a state machine, pipelines transform audio and image data,
41+ the agent layer interfaces with Gemini Live API, and the persistence layer maintains session state.
5342
54- ### Architecture
43+ ## Architecture
5544
5645``` mermaid
5746flowchart TB
47+ subgraph Client["Client"]
48+ APP[Mobile/Web App]
49+ end
50+
5851 subgraph Gateway["Gateway Layer"]
59- WS[WebSocket Handler<br/>Connection Lifecycle]
60- HB[Heartbeat Manager]
52+ WS[ConnectionGateway<br/>WebSocket Lifecycle]
6153 end
6254
63- subgraph FSM["State Machine Layer"]
64- IDLE[Idle State]
65- LISTEN[Listening State]
66- THINK[Thinking State]
67- SPEAK[Speaking State]
55+ subgraph Engine["Engine Layer"]
56+ FE[FrameEngine<br/>State Machine]
57+ CC[ConnectionContext<br/>Shared State]
6858 end
6959
7060 subgraph Pipelines["Pipeline Layer"]
71- AUDIO[Audio Pipeline<br/>16kHz Input / 24kHz Output ]
61+ AUDIO[Audio Pipeline<br/>16kHz In / 24kHz Out ]
7262 IMAGE[Image Pipeline<br/>JPEG Processing]
63+ ACTION[Action Pipeline<br/>Event Dispatch]
7364 end
7465
7566 subgraph Agent["Agent Layer"]
76- ADK[ADK Integration <br/>Google Agent SDK ]
67+ LS[LiveSession <br/>Gemini Live API ]
7768 ROUTER[Action Router<br/>SET_TIMER / SHARE_LOCATION]
7869 end
7970
8071 subgraph Persistence["Persistence Layer"]
81- SESSION [Session Manager]
72+ SM [Session Manager]
8273 FIRESTORE[(Firestore)]
8374 end
8475
85- WS --> IDLE
86- IDLE --> LISTEN
87- LISTEN --> THINK
88- THINK --> SPEAK
89- SPEAK --> IDLE
76+ APP <-->|"ws://host/ws/{device_id}"| WS
77+ WS --> FE
78+ FE --> CC
79+ CC --> AUDIO
80+ CC --> IMAGE
81+ CC --> ACTION
82+ AUDIO --> LS
83+ IMAGE --> LS
84+ ACTION --> ROUTER
85+ LS <-->|"Gemini Live API"| GEMINI[(Gemini)]
86+ FE --> SM
87+ SM <--> FIRESTORE
88+ ```
89+
90+ The architecture consists of five layers that process frames from the client through to Gemini and back:
9091
91- WS --> AUDIO
92- WS --> IMAGE
93- AUDIO --> ADK
94- IMAGE --> ADK
92+ - ** Gateway Layer** (` src/gateway/ ` ): Handles WebSocket connection acceptance, lifecycle management, and guaranteed
93+ cleanup on disconnect. It initializes the connection context and spawns the frame engine.
9594
96- ADK --> ROUTER
97- SESSION <--> FIRESTORE
98- FSM --> SESSION
99- ```
95+ - ** Engine Layer** (` src/engine/ ` ): Routes incoming frames through a state machine with five connection states: ` IDLE ` ,
96+ ` CONNECTED ` , ` ACTIVE ` , ` DRAINING ` , and ` CLOSED ` . Frames received during non-active states are queued and processed
97+ once the connection becomes active.
98+
99+ - ** Pipeline Layer** (` src/pipelines/ ` ): Transforms data between client and Gemini formats. Audio is resampled from
100+ 16kHz client input to 24kHz Gemini input, with output transcoded back. Image frames are validated and forwarded for
101+ multimodal understanding. Actions are dispatched to handlers or routed to the client.
100102
101- The backend is organized into five layers:
103+ - ** Agent Layer** (` src/agent/ ` ): Manages the bidirectional streaming connection to Gemini Live API through the Google
104+ GenAI SDK. Action handlers execute server-side logic (e.g., SET_TIMER) or process client events (e.g.,
105+ SHARE_LOCATION).
102106
103- - ** Gateway Layer** : WebSocket connection handling with heartbeat management for connection health monitoring
104- - ** State Machine Layer** : Four-state FSM (Idle -> Listening -> Thinking -> Speaking) controlling conversation flow
105- - ** Pipeline Layer** : Audio transcoding between client format (16kHz) and Gemini format (24kHz), plus image processing
106- - ** Agent Layer** : ADK integration with Gemini Live API and action routing for client-side commands
107- - ** Persistence Layer** : Session state management with 10-minute resumption window via Firestore
107+ - ** Persistence Layer** (` src/session/ ` ): Stores session state in Firestore with a device-as-identity pattern. Sessions
108+ can be resumed within a 10-minute window, allowing users to continue conversations across connection drops.
108109
109- ### Tech Stack
110+ ## Tech Stack
110111
111112<details >
112113<summary >Server</summary >
@@ -128,43 +129,36 @@ The backend is organized into five layers:
128129</ul >
129130</details >
130131
131- ### Features
132+ ## Features
132133
133- - ** Bidirectional Audio Streaming** : PCM audio with automatic format conversion (16kHz input, 24kHz output)
134- - ** Stateful Session Management** : Device-as-identity pattern with 10-minute resumption window
135- - ** Action System** : Server-to-client commands (SET_TIMER) and client-to-server events (SHARE_LOCATION)
136- - ** Visual Context Support** : JPEG frame processing for multimodal understanding
137- - ** Graceful Degradation** : Connection health monitoring with automatic cleanup
138-
139- ### Environment Variables
140-
141- To run this project, create a ` .env ` file based on ` .env.example ` :
142-
143- | Variable | Description | Required |
144- | -----------------------------| ---------------------------------------------------------------------| ------------------------------|
145- | ` GOOGLE_API_KEY ` | API key from [ Google AI Studio] ( https://aistudio.google.com/apikey ) | Yes (local dev) |
146- | ` GOOGLE_GENAI_USE_VERTEXAI ` | Set to ` FALSE ` for API key, ` TRUE ` for Vertex AI | No (defaults to Vertex AI) |
147- | ` GOOGLE_CLOUD_PROJECT ` | GCP project ID | Yes (Vertex AI mode) |
148- | ` GOOGLE_CLOUD_LOCATION ` | GCP region (must be ` us-central1 ` for Gemini Live) | No (defaults to us-central1) |
134+ - ** Bidirectional Audio Streaming** : PCM audio with automatic format conversion between client (16kHz) and Gemini (
135+ 24kHz) sample rates
136+ - ** Stateful Session Management** : Device-as-identity pattern with 10-minute resumption window for conversation
137+ continuity
138+ - ** Action System** : Server-to-client commands (SET_TIMER) and client-to-server events (SHARE_LOCATION) for interactive
139+ features
140+ - ** Visual Context Support** : JPEG frame processing for multimodal understanding with the live session
141+ - ** Graceful Degradation** : Connection state machine with frame queuing and guaranteed cleanup on disconnect
149142
150143## Getting Started
151144
152145### Prerequisites
153146
154- This project uses ` uv ` for package management. See the [ Makefile] ( Makefile ) for available commands.
147+ This project uses ` uv ` for package management. Install it from [ astral.sh/uv] ( https://docs.astral.sh/uv/ ) if you don't
148+ have it already.
155149
156150### Installation
157151
158- 1 . Clone the repository:
152+ Clone the repository and set up your environment :
159153
160154``` bash
161155git clone https://github.com/oadultradeepfield/jemmie-backend.git
162- cd gemini-live-agent-challenge-backend
156+ cd jemmie-backend
157+ cp .env.example .env
163158```
164159
165- 2 . Set up environment variables from ` .env.example `
166-
167- 3 . Run the development server:
160+ Edit ` .env ` and add your Google API key from [ Google AI Studio] ( https://aistudio.google.com/apikey ) . Then start the
161+ development server:
168162
169163``` bash
170164make dev
@@ -179,6 +173,20 @@ make check # Run linting and type checks
179173make test # Run test suite
180174```
181175
176+ ### Environment Variables
177+
178+ The ` .env.example ` file contains the minimal configuration for local development:
179+
180+ | Variable | Description | Default |
181+ | -----------------------------| ----------------------------------| ------------------------------------------|
182+ | ` GOOGLE_API_KEY ` | API key from Google AI Studio | Required for local dev |
183+ | ` GOOGLE_GENAI_USE_VERTEXAI ` | Use Vertex AI instead of API key | ` TRUE ` (use ` FALSE ` for API key) |
184+ | ` GOOGLE_CLOUD_PROJECT ` | GCP project ID | Required for Vertex AI mode |
185+ | ` GOOGLE_CLOUD_LOCATION ` | GCP region | ` us-central1 ` (required for Gemini Live) |
186+
187+ For local development with an API key, set ` GOOGLE_GENAI_USE_VERTEXAI=FALSE ` and provide your ` GOOGLE_API_KEY ` . For
188+ production deployment on Cloud Run, the service account credentials are used automatically with Vertex AI.
189+
182190## Deployment
183191
184192### Infrastructure Setup
@@ -189,19 +197,15 @@ Run the setup script to create required GCP resources:
189197make setup-infra PROJECT_ID=your-project-id
190198```
191199
192- This creates:
193-
194- - Artifact Registry repository for Docker images
195- - Firestore database for session storage
196- - Service account with required permissions
200+ This creates an Artifact Registry repository for Docker images, a Firestore database for session storage, and a service
201+ account with the required permissions.
197202
198203### GitHub Actions Deployment
199204
200- 1 . Add secrets to your GitHub repository:
201- - ` GCP_PROJECT_ID ` : Your Google Cloud project ID
202- - ` GCP_SERVICE_ACCOUNT_KEY ` : Service account JSON key
205+ Add these secrets to your GitHub repository to enable automatic deployment on push to main:
203206
204- 2 . Push to the main branch to trigger automatic deployment
207+ - ` GCP_PROJECT_ID ` : Your Google Cloud project ID
208+ - ` GCP_SERVICE_ACCOUNT_KEY ` : Service account JSON key (from the setup script output)
205209
206210### GCP Deployment Proof
207211
0 commit comments