|
| 1 | +# node-voice-agent |
| 2 | + |
| 3 | +Node.js demo app for Deepgram Voice Agent. |
| 4 | + |
| 5 | +## Architecture |
| 6 | + |
| 7 | +- **Backend:** Node.js (JavaScript) on port 8081 |
| 8 | +- **Frontend:** Vite + vanilla JS on port 8080 (git submodule: `voice-agent-html`) |
| 9 | +- **API type:** WebSocket — `WS /api/voice-agent` |
| 10 | +- **Deepgram API:** Agent API (`wss://agent.deepgram.com/v1/agent/converse`) |
| 11 | +- **Auth:** JWT session tokens via `/api/session` (WebSocket auth uses `access_token.<jwt>` subprotocol) |
| 12 | + |
| 13 | +## Key Files |
| 14 | + |
| 15 | +| File | Purpose | |
| 16 | +|------|---------| |
| 17 | +| `server.js` | Main backend — API endpoints and WebSocket proxy | |
| 18 | +| `deepgram.toml` | Metadata, lifecycle commands, tags | |
| 19 | +| `Makefile` | Standardized build/run targets | |
| 20 | +| `sample.env` | Environment variable template | |
| 21 | +| `frontend/main.js` | Frontend logic — UI controls, WebSocket connection, audio streaming | |
| 22 | +| `frontend/index.html` | HTML structure and UI layout | |
| 23 | +| `deploy/Dockerfile` | Production container (Caddy + backend) | |
| 24 | +| `deploy/Caddyfile` | Reverse proxy, rate limiting, static serving | |
| 25 | + |
| 26 | +## Quick Start |
| 27 | + |
| 28 | +```bash |
| 29 | +# Initialize (clone submodules + install deps) |
| 30 | +make init |
| 31 | + |
| 32 | +# Set up environment |
| 33 | +test -f .env || cp sample.env .env # then set DEEPGRAM_API_KEY |
| 34 | + |
| 35 | +# Start both servers |
| 36 | +make start |
| 37 | +# Backend: http://localhost:8081 |
| 38 | +# Frontend: http://localhost:8080 |
| 39 | +``` |
| 40 | + |
| 41 | +## Start / Stop |
| 42 | + |
| 43 | +**Start (recommended):** |
| 44 | +```bash |
| 45 | +make start |
| 46 | +``` |
| 47 | + |
| 48 | +**Start separately:** |
| 49 | +```bash |
| 50 | +# Terminal 1 — Backend |
| 51 | +node server.js |
| 52 | + |
| 53 | +# Terminal 2 — Frontend |
| 54 | +cd frontend && corepack pnpm run dev -- --port 8080 --no-open |
| 55 | +``` |
| 56 | + |
| 57 | +**Stop all:** |
| 58 | +```bash |
| 59 | +lsof -ti:8080,8081 | xargs kill -9 2>/dev/null |
| 60 | +``` |
| 61 | + |
| 62 | +**Clean rebuild:** |
| 63 | +```bash |
| 64 | +rm -rf node_modules frontend/node_modules frontend/.vite |
| 65 | +make init |
| 66 | +``` |
| 67 | + |
| 68 | +## Dependencies |
| 69 | + |
| 70 | +- **Backend:** `package.json` — Uses `corepack pnpm` — Node's built-in package manager version pinning. |
| 71 | +- **Frontend:** `frontend/package.json` — Vite dev server |
| 72 | +- **Submodules:** `frontend/` (voice-agent-html), `contracts/` (starter-contracts) |
| 73 | + |
| 74 | +Install: `corepack pnpm install` |
| 75 | +Frontend: `cd frontend && corepack pnpm install` |
| 76 | + |
| 77 | +## API Endpoints |
| 78 | + |
| 79 | +| Endpoint | Method | Auth | Purpose | |
| 80 | +|----------|--------|------|---------| |
| 81 | +| `/api/session` | GET | None | Issue JWT session token | |
| 82 | +| `/api/metadata` | GET | None | Return app metadata (useCase, framework, language) | |
| 83 | +| `/api/voice-agent` | WS | JWT | Full-duplex voice conversation with an AI agent. | |
| 84 | + |
| 85 | +## Customization Guide |
| 86 | + |
| 87 | +### How the Agent Works |
| 88 | +The backend is a **pure WebSocket proxy** — it forwards messages between the browser and Deepgram's Agent API. All agent configuration happens via JSON messages from the frontend. |
| 89 | + |
| 90 | +### Agent Settings (sent from frontend) |
| 91 | +The frontend sends a `Settings` message after connecting: |
| 92 | + |
| 93 | +```json |
| 94 | +{ |
| 95 | + "type": "Settings", |
| 96 | + "audio": { |
| 97 | + "input": { "encoding": "linear16", "sample_rate": 16000 }, |
| 98 | + "output": { "encoding": "linear16", "sample_rate": 16000 } |
| 99 | + }, |
| 100 | + "agent": { |
| 101 | + "listen": { "provider": { "type": "deepgram", "model": "nova-3" } }, |
| 102 | + "speak": { "provider": { "type": "deepgram", "model": "aura-2-thalia-en" } }, |
| 103 | + "think": { |
| 104 | + "provider": { "type": "open_ai", "model": "gpt-4o-mini" }, |
| 105 | + "prompt": "You are a helpful assistant." |
| 106 | + } |
| 107 | + } |
| 108 | +} |
| 109 | +``` |
| 110 | + |
| 111 | +### Customizable Components |
| 112 | + |
| 113 | +| Component | Field | Options | Effect | |
| 114 | +|-----------|-------|---------|--------| |
| 115 | +| **Listen** (STT) | `agent.listen.provider.model` | `nova-3`, `nova-2` | Speech recognition model | |
| 116 | +| **Speak** (TTS) | `agent.speak.provider.model` | Any `aura-*` voice | Agent's voice | |
| 117 | +| **Think** (LLM) | `agent.think.provider.type` | `open_ai`, `anthropic` | LLM provider | |
| 118 | +| **Think** (LLM) | `agent.think.provider.model` | `gpt-4o-mini`, `gpt-4o`, etc. | LLM model | |
| 119 | +| **Prompt** | `agent.think.prompt` | Any system prompt | Agent personality/behavior | |
| 120 | + |
| 121 | +### Live Updates (no reconnect needed) |
| 122 | +The frontend can update these settings mid-conversation: |
| 123 | +- `{ "type": "UpdateSpeak", "model": "aura-2-luna-en" }` — Change voice |
| 124 | +- `{ "type": "UpdatePrompt", "prompt": "New instructions..." }` — Change prompt |
| 125 | +- `{ "type": "InjectUserMessage", "content": "text" }` — Send text as user |
| 126 | + |
| 127 | +### Adding Function Calling |
| 128 | +The Agent API supports function calling. Add a `functions` array to the Settings message: |
| 129 | +```json |
| 130 | +{ |
| 131 | + "agent": { |
| 132 | + "think": { |
| 133 | + "functions": [ |
| 134 | + { |
| 135 | + "name": "get_weather", |
| 136 | + "description": "Get current weather", |
| 137 | + "parameters": { "type": "object", "properties": { "city": { "type": "string" } } } |
| 138 | + } |
| 139 | + ] |
| 140 | + } |
| 141 | + } |
| 142 | +} |
| 143 | +``` |
| 144 | +Then handle `FunctionCallRequest` messages in the frontend and respond with `FunctionCallResponse`. |
| 145 | + |
| 146 | +### Frontend UI Controls |
| 147 | +The frontend provides: |
| 148 | +- Model dropdowns for listen/speak/think (pre-connection) |
| 149 | +- System prompt textarea (editable pre and post connection) |
| 150 | +- Chat input for text messages |
| 151 | +- "Update Settings" button for live changes |
| 152 | + |
| 153 | +To add new controls, edit `frontend/main.js` and include the values in the Settings/Update messages. |
| 154 | + |
| 155 | +## Frontend Changes |
| 156 | + |
| 157 | +The frontend is a git submodule from `deepgram-starters/voice-agent-html`. To modify: |
| 158 | + |
| 159 | +1. **Edit files in `frontend/`** — this is the working copy |
| 160 | +2. **Test locally** — changes reflect immediately via Vite HMR |
| 161 | +3. **Commit in the submodule:** `cd frontend && git add . && git commit -m "feat: description"` |
| 162 | +4. **Push the frontend repo:** `cd frontend && git push origin main` |
| 163 | +5. **Update the submodule ref:** `cd .. && git add frontend && git commit -m "chore(deps): update frontend submodule"` |
| 164 | + |
| 165 | +**IMPORTANT:** Always edit `frontend/` inside THIS starter directory. The standalone `voice-agent-html/` directory at the monorepo root is a separate checkout. |
| 166 | + |
| 167 | +### Adding a UI Control for a New Feature |
| 168 | +1. Add the HTML element in `frontend/index.html` (input, checkbox, dropdown, etc.) |
| 169 | +2. Read the value in `frontend/main.js` when making the API call or opening the WebSocket |
| 170 | +3. Pass it as a query parameter in the WebSocket URL |
| 171 | +4. Handle it in the backend `server.js` — read the param and pass it to the Deepgram API |
| 172 | + |
| 173 | +## Environment Variables |
| 174 | + |
| 175 | +| Variable | Required | Default | Purpose | |
| 176 | +|----------|----------|---------|---------| |
| 177 | +| `DEEPGRAM_API_KEY` | Yes | — | Deepgram API key | |
| 178 | +| `PORT` | No | `8081` | Backend server port | |
| 179 | +| `HOST` | No | `0.0.0.0` | Backend bind address | |
| 180 | +| `SESSION_SECRET` | No | — | JWT signing secret (production) | |
| 181 | + |
| 182 | +## Conventional Commits |
| 183 | + |
| 184 | +All commits must follow conventional commits format. Never include `Co-Authored-By` lines for Claude. |
| 185 | + |
| 186 | +``` |
| 187 | +feat(node-voice-agent): add diarization support |
| 188 | +fix(node-voice-agent): resolve WebSocket close handling |
| 189 | +refactor(node-voice-agent): simplify session endpoint |
| 190 | +chore(deps): update frontend submodule |
| 191 | +``` |
| 192 | + |
| 193 | +## Testing |
| 194 | + |
| 195 | +```bash |
| 196 | +# Run conformance tests (requires app to be running) |
| 197 | +make test |
| 198 | + |
| 199 | +# Manual endpoint check |
| 200 | +curl -sf http://localhost:8081/api/metadata | python3 -m json.tool |
| 201 | +curl -sf http://localhost:8081/api/session | python3 -m json.tool |
| 202 | +``` |
0 commit comments