Skip to content

Commit 621ac83

Browse files
Revise README with accurate architecture and improved formatting
1 parent d2ef45e commit 621ac83

2 files changed

Lines changed: 97 additions & 87 deletions

File tree

.env.example

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,12 @@
1-
# Required: Get your API key from https://aistudio.google.com/apikey
1+
# Local development with API key (default mode)
2+
# Get your API key from https://aistudio.google.com/apikey
23
GOOGLE_API_KEY=your-api-key-here
4+
GOOGLE_GENAI_USE_VERTEXAI=FALSE
35

4-
# Optional: Override defaults for local development
5-
# GOOGLE_GENAI_USE_VERTEXAI=FALSE
6+
# Production mode with Vertex AI (used on Cloud Run)
7+
# GOOGLE_GENAI_USE_VERTEXAI=TRUE
8+
# GOOGLE_CLOUD_PROJECT=your-project-id
9+
# GOOGLE_CLOUD_LOCATION=us-central1
10+
11+
# Optional
612
# LOG_LEVEL=DEBUG

README.md

Lines changed: 88 additions & 84 deletions
Original file line numberDiff line numberDiff line change
@@ -18,95 +18,96 @@
1818
</p>
1919

2020
<h4>
21-
<a href="#deployment">View Demo</a>
21+
<a href="#deployment">Deployment Proof</a>
2222
<span> | </span>
23-
<a href="#getting-started">Documentation</a>
23+
<a href="#getting-started">Getting Started</a>
2424
<span> | </span>
2525
<a href="https://github.com/oadultradeepfield/jemmie-backend/issues/">Report Bug</a>
2626
</h4>
2727
</div>
2828

2929
<br />
3030

31-
# Table of Contents
32-
33-
- [About the Project](#about-the-project)
34-
- [Architecture](#architecture)
35-
- [Tech Stack](#tech-stack)
36-
- [Features](#features)
37-
- [Environment Variables](#environment-variables)
38-
- [Getting Started](#getting-started)
39-
- [Prerequisites](#prerequisites)
40-
- [Installation](#installation)
41-
- [Running Tests](#running-tests)
42-
- [Deployment](#deployment)
43-
- [License](#license)
44-
4531
## About the Project
4632

47-
Jemmie is a real-time voice agent backend that delivers sub-second audio latency through the Gemini Live API. Built for
48-
the [Gemini Live Agent Challenge](https://geminiliveagentchallenge.devpost.com), it provides a WebSocket-based
49-
infrastructure for natural voice interaction with session persistence and action handling.
33+
Jemmie is a real-time voice agent backend built for
34+
the [Gemini Live Agent Challenge](https://geminiliveagentchallenge.devpost.com). It delivers sub-second audio latency
35+
through the Gemini Live API, providing WebSocket-based infrastructure for natural voice interaction with session
36+
persistence and action handling.
5037

51-
The backend handles bidirectional audio streaming, visual context processing, and stateful session management with a
52-
layered architecture designed for extensibility and testability.
38+
The system handles bidirectional audio streaming, visual context processing, and stateful session management through a
39+
layered architecture designed for extensibility and testability. Each layer has a single responsibility: the gateway
40+
manages WebSocket lifecycle, the engine routes frames through a state machine, pipelines transform audio and image data,
41+
the agent layer interfaces with Gemini Live API, and the persistence layer maintains session state.
5342

54-
### Architecture
43+
## Architecture
5544

5645
```mermaid
5746
flowchart TB
47+
subgraph Client["Client"]
48+
APP[Mobile/Web App]
49+
end
50+
5851
subgraph Gateway["Gateway Layer"]
59-
WS[WebSocket Handler<br/>Connection Lifecycle]
60-
HB[Heartbeat Manager]
52+
WS[ConnectionGateway<br/>WebSocket Lifecycle]
6153
end
6254
63-
subgraph FSM["State Machine Layer"]
64-
IDLE[Idle State]
65-
LISTEN[Listening State]
66-
THINK[Thinking State]
67-
SPEAK[Speaking State]
55+
subgraph Engine["Engine Layer"]
56+
FE[FrameEngine<br/>State Machine]
57+
CC[ConnectionContext<br/>Shared State]
6858
end
6959
7060
subgraph Pipelines["Pipeline Layer"]
71-
AUDIO[Audio Pipeline<br/>16kHz Input / 24kHz Output]
61+
AUDIO[Audio Pipeline<br/>16kHz In / 24kHz Out]
7262
IMAGE[Image Pipeline<br/>JPEG Processing]
63+
ACTION[Action Pipeline<br/>Event Dispatch]
7364
end
7465
7566
subgraph Agent["Agent Layer"]
76-
ADK[ADK Integration<br/>Google Agent SDK]
67+
LS[LiveSession<br/>Gemini Live API]
7768
ROUTER[Action Router<br/>SET_TIMER / SHARE_LOCATION]
7869
end
7970
8071
subgraph Persistence["Persistence Layer"]
81-
SESSION[Session Manager]
72+
SM[Session Manager]
8273
FIRESTORE[(Firestore)]
8374
end
8475
85-
WS --> IDLE
86-
IDLE --> LISTEN
87-
LISTEN --> THINK
88-
THINK --> SPEAK
89-
SPEAK --> IDLE
76+
APP <-->|"ws://host/ws/{device_id}"| WS
77+
WS --> FE
78+
FE --> CC
79+
CC --> AUDIO
80+
CC --> IMAGE
81+
CC --> ACTION
82+
AUDIO --> LS
83+
IMAGE --> LS
84+
ACTION --> ROUTER
85+
LS <-->|"Gemini Live API"| GEMINI[(Gemini)]
86+
FE --> SM
87+
SM <--> FIRESTORE
88+
```
89+
90+
The architecture consists of five layers that process frames from the client through to Gemini and back:
9091

91-
WS --> AUDIO
92-
WS --> IMAGE
93-
AUDIO --> ADK
94-
IMAGE --> ADK
92+
- **Gateway Layer** (`src/gateway/`): Handles WebSocket connection acceptance, lifecycle management, and guaranteed
93+
cleanup on disconnect. It initializes the connection context and spawns the frame engine.
9594

96-
ADK --> ROUTER
97-
SESSION <--> FIRESTORE
98-
FSM --> SESSION
99-
```
95+
- **Engine Layer** (`src/engine/`): Routes incoming frames through a state machine with five connection states: `IDLE`,
96+
`CONNECTED`, `ACTIVE`, `DRAINING`, and `CLOSED`. Frames received during non-active states are queued and processed
97+
once the connection becomes active.
98+
99+
- **Pipeline Layer** (`src/pipelines/`): Transforms data between client and Gemini formats. Audio is resampled from
100+
16kHz client input to 24kHz Gemini input, with output transcoded back. Image frames are validated and forwarded for
101+
multimodal understanding. Actions are dispatched to handlers or routed to the client.
100102

101-
The backend is organized into five layers:
103+
- **Agent Layer** (`src/agent/`): Manages the bidirectional streaming connection to Gemini Live API through the Google
104+
GenAI SDK. Action handlers execute server-side logic (e.g., SET_TIMER) or process client events (e.g.,
105+
SHARE_LOCATION).
102106

103-
- **Gateway Layer**: WebSocket connection handling with heartbeat management for connection health monitoring
104-
- **State Machine Layer**: Four-state FSM (Idle -> Listening -> Thinking -> Speaking) controlling conversation flow
105-
- **Pipeline Layer**: Audio transcoding between client format (16kHz) and Gemini format (24kHz), plus image processing
106-
- **Agent Layer**: ADK integration with Gemini Live API and action routing for client-side commands
107-
- **Persistence Layer**: Session state management with 10-minute resumption window via Firestore
107+
- **Persistence Layer** (`src/session/`): Stores session state in Firestore with a device-as-identity pattern. Sessions
108+
can be resumed within a 10-minute window, allowing users to continue conversations across connection drops.
108109

109-
### Tech Stack
110+
## Tech Stack
110111

111112
<details>
112113
<summary>Server</summary>
@@ -128,43 +129,36 @@ The backend is organized into five layers:
128129
</ul>
129130
</details>
130131

131-
### Features
132+
## Features
132133

133-
- **Bidirectional Audio Streaming**: PCM audio with automatic format conversion (16kHz input, 24kHz output)
134-
- **Stateful Session Management**: Device-as-identity pattern with 10-minute resumption window
135-
- **Action System**: Server-to-client commands (SET_TIMER) and client-to-server events (SHARE_LOCATION)
136-
- **Visual Context Support**: JPEG frame processing for multimodal understanding
137-
- **Graceful Degradation**: Connection health monitoring with automatic cleanup
138-
139-
### Environment Variables
140-
141-
To run this project, create a `.env` file based on `.env.example`:
142-
143-
| Variable | Description | Required |
144-
|-----------------------------|---------------------------------------------------------------------|------------------------------|
145-
| `GOOGLE_API_KEY` | API key from [Google AI Studio](https://aistudio.google.com/apikey) | Yes (local dev) |
146-
| `GOOGLE_GENAI_USE_VERTEXAI` | Set to `FALSE` for API key, `TRUE` for Vertex AI | No (defaults to Vertex AI) |
147-
| `GOOGLE_CLOUD_PROJECT` | GCP project ID | Yes (Vertex AI mode) |
148-
| `GOOGLE_CLOUD_LOCATION` | GCP region (must be `us-central1` for Gemini Live) | No (defaults to us-central1) |
134+
- **Bidirectional Audio Streaming**: PCM audio with automatic format conversion between client (16kHz) and Gemini (
135+
24kHz) sample rates
136+
- **Stateful Session Management**: Device-as-identity pattern with 10-minute resumption window for conversation
137+
continuity
138+
- **Action System**: Server-to-client commands (SET_TIMER) and client-to-server events (SHARE_LOCATION) for interactive
139+
features
140+
- **Visual Context Support**: JPEG frame processing for multimodal understanding with the live session
141+
- **Graceful Degradation**: Connection state machine with frame queuing and guaranteed cleanup on disconnect
149142

150143
## Getting Started
151144

152145
### Prerequisites
153146

154-
This project uses `uv` for package management. See the [Makefile](Makefile) for available commands.
147+
This project uses `uv` for package management. Install it from [astral.sh/uv](https://docs.astral.sh/uv/) if you don't
148+
have it already.
155149

156150
### Installation
157151

158-
1. Clone the repository:
152+
Clone the repository and set up your environment:
159153

160154
```bash
161155
git clone https://github.com/oadultradeepfield/jemmie-backend.git
162-
cd gemini-live-agent-challenge-backend
156+
cd jemmie-backend
157+
cp .env.example .env
163158
```
164159

165-
2. Set up environment variables from `.env.example`
166-
167-
3. Run the development server:
160+
Edit `.env` and add your Google API key from [Google AI Studio](https://aistudio.google.com/apikey). Then start the
161+
development server:
168162

169163
```bash
170164
make dev
@@ -179,6 +173,20 @@ make check # Run linting and type checks
179173
make test # Run test suite
180174
```
181175

176+
### Environment Variables
177+
178+
The `.env.example` file contains the minimal configuration for local development:
179+
180+
| Variable | Description | Default |
181+
|-----------------------------|----------------------------------|------------------------------------------|
182+
| `GOOGLE_API_KEY` | API key from Google AI Studio | Required for local dev |
183+
| `GOOGLE_GENAI_USE_VERTEXAI` | Use Vertex AI instead of API key | `TRUE` (use `FALSE` for API key) |
184+
| `GOOGLE_CLOUD_PROJECT` | GCP project ID | Required for Vertex AI mode |
185+
| `GOOGLE_CLOUD_LOCATION` | GCP region | `us-central1` (required for Gemini Live) |
186+
187+
For local development with an API key, set `GOOGLE_GENAI_USE_VERTEXAI=FALSE` and provide your `GOOGLE_API_KEY`. For
188+
production deployment on Cloud Run, the service account credentials are used automatically with Vertex AI.
189+
182190
## Deployment
183191

184192
### Infrastructure Setup
@@ -189,19 +197,15 @@ Run the setup script to create required GCP resources:
189197
make setup-infra PROJECT_ID=your-project-id
190198
```
191199

192-
This creates:
193-
194-
- Artifact Registry repository for Docker images
195-
- Firestore database for session storage
196-
- Service account with required permissions
200+
This creates an Artifact Registry repository for Docker images, a Firestore database for session storage, and a service
201+
account with the required permissions.
197202

198203
### GitHub Actions Deployment
199204

200-
1. Add secrets to your GitHub repository:
201-
- `GCP_PROJECT_ID`: Your Google Cloud project ID
202-
- `GCP_SERVICE_ACCOUNT_KEY`: Service account JSON key
205+
Add these secrets to your GitHub repository to enable automatic deployment on push to main:
203206

204-
2. Push to the main branch to trigger automatic deployment
207+
- `GCP_PROJECT_ID`: Your Google Cloud project ID
208+
- `GCP_SERVICE_ACCOUNT_KEY`: Service account JSON key (from the setup script output)
205209

206210
### GCP Deployment Proof
207211

0 commit comments

Comments
 (0)