Revise README with accurate architecture and improved formatting

oadultradeepfield · oadultradeepfield · commit 621ac83fa76d · 2026-03-08T14:58:43.000+08:00
diff --git a/.env.example b/.env.example
@@ -1,6 +1,12 @@
-# Required: Get your API key from https://aistudio.google.com/apikey
+# Local development with API key (default mode)
+# Get your API key from https://aistudio.google.com/apikey
 GOOGLE_API_KEY=your-api-key-here
+GOOGLE_GENAI_USE_VERTEXAI=FALSE
 
-# Optional: Override defaults for local development
-# GOOGLE_GENAI_USE_VERTEXAI=FALSE
+# Production mode with Vertex AI (used on Cloud Run)
+# GOOGLE_GENAI_USE_VERTEXAI=TRUE
+# GOOGLE_CLOUD_PROJECT=your-project-id
+# GOOGLE_CLOUD_LOCATION=us-central1
+
+# Optional
 # LOG_LEVEL=DEBUG
diff --git a/README.md b/README.md
@@ -18,95 +18,96 @@
 </p>
 
 <h4>
-  <a href="#deployment">View Demo</a>
+  <a href="#deployment">Deployment Proof</a>
   <span> | </span>
-  <a href="#getting-started">Documentation</a>
+  <a href="#getting-started">Getting Started</a>
   <span> | </span>
   <a href="https://github.com/oadultradeepfield/jemmie-backend/issues/">Report Bug</a>
 </h4>
 </div>
 
 <br />
 
-# Table of Contents
-
-- [About the Project](#about-the-project)
-    - [Architecture](#architecture)
-    - [Tech Stack](#tech-stack)
-    - [Features](#features)
-    - [Environment Variables](#environment-variables)
-- [Getting Started](#getting-started)
-    - [Prerequisites](#prerequisites)
-    - [Installation](#installation)
-    - [Running Tests](#running-tests)
-- [Deployment](#deployment)
-- [License](#license)
-
 ## About the Project
 
-Jemmie is a real-time voice agent backend that delivers sub-second audio latency through the Gemini Live API. Built for
-the [Gemini Live Agent Challenge](https://geminiliveagentchallenge.devpost.com), it provides a WebSocket-based
-infrastructure for natural voice interaction with session persistence and action handling.
+Jemmie is a real-time voice agent backend built for
+the [Gemini Live Agent Challenge](https://geminiliveagentchallenge.devpost.com). It delivers sub-second audio latency
+through the Gemini Live API, providing WebSocket-based infrastructure for natural voice interaction with session
+persistence and action handling.
 
-The backend handles bidirectional audio streaming, visual context processing, and stateful session management with a
-layered architecture designed for extensibility and testability.
+The system handles bidirectional audio streaming, visual context processing, and stateful session management through a
+layered architecture designed for extensibility and testability. Each layer has a single responsibility: the gateway
+manages WebSocket lifecycle, the engine routes frames through a state machine, pipelines transform audio and image data,
+the agent layer interfaces with Gemini Live API, and the persistence layer maintains session state.
 
-### Architecture
+## Architecture
 
 ```mermaid
 flowchart TB
+    subgraph Client["Client"]
+        APP[Mobile/Web App]
+    end
+
     subgraph Gateway["Gateway Layer"]
-        WS[WebSocket Handler<br/>Connection Lifecycle]
-        HB[Heartbeat Manager]
+        WS[ConnectionGateway<br/>WebSocket Lifecycle]
     end
 
-    subgraph FSM["State Machine Layer"]
-        IDLE[Idle State]
-        LISTEN[Listening State]
-        THINK[Thinking State]
-        SPEAK[Speaking State]
+    subgraph Engine["Engine Layer"]
+        FE[FrameEngine<br/>State Machine]
+        CC[ConnectionContext<br/>Shared State]
     end
 
     subgraph Pipelines["Pipeline Layer"]
-        AUDIO[Audio Pipeline<br/>16kHz Input / 24kHz Output]
+        AUDIO[Audio Pipeline<br/>16kHz In / 24kHz Out]
         IMAGE[Image Pipeline<br/>JPEG Processing]
+        ACTION[Action Pipeline<br/>Event Dispatch]
     end
 
     subgraph Agent["Agent Layer"]
-        ADK[ADK Integration<br/>Google Agent SDK]
+        LS[LiveSession<br/>Gemini Live API]
         ROUTER[Action Router<br/>SET_TIMER / SHARE_LOCATION]
     end
 
     subgraph Persistence["Persistence Layer"]
-        SESSION[Session Manager]
+        SM[Session Manager]
         FIRESTORE[(Firestore)]
     end
 
-    WS --> IDLE
-    IDLE --> LISTEN
-    LISTEN --> THINK
-    THINK --> SPEAK
-    SPEAK --> IDLE
+    APP <-->|"ws://host/ws/{device_id}"| WS
+    WS --> FE
+    FE --> CC
+    CC --> AUDIO
+    CC --> IMAGE
+    CC --> ACTION
+    AUDIO --> LS
+    IMAGE --> LS
+    ACTION --> ROUTER
+    LS <-->|"Gemini Live API"| GEMINI[(Gemini)]
+    FE --> SM
+    SM <--> FIRESTORE
+```
+
+The architecture consists of five layers that process frames from the client through to Gemini and back:
 
-    WS --> AUDIO
-    WS --> IMAGE
-    AUDIO --> ADK
-    IMAGE --> ADK
+- **Gateway Layer** (`src/gateway/`): Handles WebSocket connection acceptance, lifecycle management, and guaranteed
+  cleanup on disconnect. It initializes the connection context and spawns the frame engine.
 
-    ADK --> ROUTER
-    SESSION <--> FIRESTORE
-    FSM --> SESSION
-```
+- **Engine Layer** (`src/engine/`): Routes incoming frames through a state machine with five connection states: `IDLE`,
+  `CONNECTED`, `ACTIVE`, `DRAINING`, and `CLOSED`. Frames received during non-active states are queued and processed
+  once the connection becomes active.
+
+- **Pipeline Layer** (`src/pipelines/`): Transforms data between client and Gemini formats. Audio is resampled from
+  16kHz client input to 24kHz Gemini input, with output transcoded back. Image frames are validated and forwarded for
+  multimodal understanding. Actions are dispatched to handlers or routed to the client.
 
-The backend is organized into five layers:
+- **Agent Layer** (`src/agent/`): Manages the bidirectional streaming connection to Gemini Live API through the Google
+  GenAI SDK. Action handlers execute server-side logic (e.g., SET_TIMER) or process client events (e.g.,
+  SHARE_LOCATION).
 
-- **Gateway Layer**: WebSocket connection handling with heartbeat management for connection health monitoring
-- **State Machine Layer**: Four-state FSM (Idle -> Listening -> Thinking -> Speaking) controlling conversation flow
-- **Pipeline Layer**: Audio transcoding between client format (16kHz) and Gemini format (24kHz), plus image processing
-- **Agent Layer**: ADK integration with Gemini Live API and action routing for client-side commands
-- **Persistence Layer**: Session state management with 10-minute resumption window via Firestore
+- **Persistence Layer** (`src/session/`): Stores session state in Firestore with a device-as-identity pattern. Sessions
+  can be resumed within a 10-minute window, allowing users to continue conversations across connection drops.
 
-### Tech Stack
+## Tech Stack
 
 <details>
 <summary>Server</summary>
@@ -128,43 +129,36 @@ The backend is organized into five layers:
 </ul>
 </details>
 
-### Features
+## Features
 
-- **Bidirectional Audio Streaming**: PCM audio with automatic format conversion (16kHz input, 24kHz output)
-- **Stateful Session Management**: Device-as-identity pattern with 10-minute resumption window
-- **Action System**: Server-to-client commands (SET_TIMER) and client-to-server events (SHARE_LOCATION)
-- **Visual Context Support**: JPEG frame processing for multimodal understanding
-- **Graceful Degradation**: Connection health monitoring with automatic cleanup
-
-### Environment Variables
-
-To run this project, create a `.env` file based on `.env.example`:
-
-| Variable                    | Description                                                         | Required                     |
-|-----------------------------|---------------------------------------------------------------------|------------------------------|
-| `GOOGLE_API_KEY`            | API key from [Google AI Studio](https://aistudio.google.com/apikey) | Yes (local dev)              |
-| `GOOGLE_GENAI_USE_VERTEXAI` | Set to `FALSE` for API key, `TRUE` for Vertex AI                    | No (defaults to Vertex AI)   |
-| `GOOGLE_CLOUD_PROJECT`      | GCP project ID                                                      | Yes (Vertex AI mode)         |
-| `GOOGLE_CLOUD_LOCATION`     | GCP region (must be `us-central1` for Gemini Live)                  | No (defaults to us-central1) |
+- **Bidirectional Audio Streaming**: PCM audio with automatic format conversion between client (16kHz) and Gemini (
+  24kHz) sample rates
+- **Stateful Session Management**: Device-as-identity pattern with 10-minute resumption window for conversation
+  continuity
+- **Action System**: Server-to-client commands (SET_TIMER) and client-to-server events (SHARE_LOCATION) for interactive
+  features
+- **Visual Context Support**: JPEG frame processing for multimodal understanding with the live session
+- **Graceful Degradation**: Connection state machine with frame queuing and guaranteed cleanup on disconnect
 
 ## Getting Started
 
 ### Prerequisites
 
-This project uses `uv` for package management. See the [Makefile](Makefile) for available commands.
+This project uses `uv` for package management. Install it from [astral.sh/uv](https://docs.astral.sh/uv/) if you don't
+have it already.
 
 ### Installation
 
-1. Clone the repository:
+Clone the repository and set up your environment:
 
 ```bash
 git clone https://github.com/oadultradeepfield/jemmie-backend.git
-cd gemini-live-agent-challenge-backend
+cd jemmie-backend
+cp .env.example .env
 ```
 
-2. Set up environment variables from `.env.example`
-
-3. Run the development server:
+Edit `.env` and add your Google API key from [Google AI Studio](https://aistudio.google.com/apikey). Then start the
+development server:
 
 ```bash
 make dev
@@ -179,6 +173,20 @@ make check    # Run linting and type checks
 make test     # Run test suite
 ```
 
+### Environment Variables
+
+The `.env.example` file contains the minimal configuration for local development:
+
+| Variable                    | Description                      | Default                                  |
+|-----------------------------|----------------------------------|------------------------------------------|
+| `GOOGLE_API_KEY`            | API key from Google AI Studio    | Required for local dev                   |
+| `GOOGLE_GENAI_USE_VERTEXAI` | Use Vertex AI instead of API key | `TRUE` (use `FALSE` for API key)         |
+| `GOOGLE_CLOUD_PROJECT`      | GCP project ID                   | Required for Vertex AI mode              |
+| `GOOGLE_CLOUD_LOCATION`     | GCP region                       | `us-central1` (required for Gemini Live) |
+
+For local development with an API key, set `GOOGLE_GENAI_USE_VERTEXAI=FALSE` and provide your `GOOGLE_API_KEY`. For
+production deployment on Cloud Run, the service account credentials are used automatically with Vertex AI.
+
 ## Deployment
 
 ### Infrastructure Setup
@@ -189,19 +197,15 @@ Run the setup script to create required GCP resources:
 make setup-infra PROJECT_ID=your-project-id
 ```
 
-This creates:
-
-- Artifact Registry repository for Docker images
-- Firestore database for session storage
-- Service account with required permissions
+This creates an Artifact Registry repository for Docker images, a Firestore database for session storage, and a service
+account with the required permissions.
 
 ### GitHub Actions Deployment
 
-1. Add secrets to your GitHub repository:
-    - `GCP_PROJECT_ID`: Your Google Cloud project ID
-    - `GCP_SERVICE_ACCOUNT_KEY`: Service account JSON key
+Add these secrets to your GitHub repository to enable automatic deployment on push to main:
 
-2. Push to the main branch to trigger automatic deployment
+- `GCP_PROJECT_ID`: Your Google Cloud project ID
+- `GCP_SERVICE_ACCOUNT_KEY`: Service account JSON key (from the setup script output)
 
 ### GCP Deployment Proof