You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Switch voice agent from custom model key to Claude Haiku 4.5
Replace the non-existent anthropic/claude-sonnet-4-5-voice model key
with anthropic/claude-haiku-4-5, a real Anthropic model that provides
~2-4x faster time-to-first-token and 3x lower cost while maintaining
equivalent quality for short conversational voice replies. Remove the
maxTokens workaround as Haiku's natural brevity combined with the
SOUL.md instructions is sufficient for voice response length control.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| openWakeWord Embedding |~1.3 MB | Wake Word Feature Extraction |
331
+
| openWakeWord Melspectrogram |~1.1 MB | Wake Word Audio Processing |
332
332
333
333
**Wake word model note:** The openWakeWord `.onnx` files use Git LFS. The download script fetches them from GitHub Releases to avoid getting HTML redirect pages instead of model binaries. If models seem corrupt:
334
334
@@ -399,14 +399,14 @@ Say **"Hey Jarvis"** into the microphone. You should see transcription in the vo
| Client → Server |`ping`|`{ "type": "ping", "timestamp": 1234567890 }`|
409
+
| Server → Client |`pong`|`{ "type": "pong", "timestamp": 1234567890 }`|
410
410
411
411
---
412
412
@@ -456,12 +456,14 @@ You are OpenClawPi — a conversational voice assistant running on a Raspberry P
456
456
Your responses are spoken aloud via text-to-speech. Brevity is essential.
457
457
458
458
CRITICAL — Response length limits (these are hard rules):
459
+
459
460
- Simple questions (weather, time, facts): 1–2 sentences maximum.
460
461
- Explanations or summaries: 3–4 sentences maximum.
461
462
- Complex topics: 5–6 sentences maximum, then stop and offer to continue.
462
463
- NEVER exceed 6 sentences in a single response.
463
464
464
465
Speech rules:
466
+
465
467
- Speak naturally using flowing sentences, not bullet points.
466
468
- NEVER use markdown (no bold, italic, headers, code blocks, bullet lists, links).
467
469
- Use contractions naturally ("I'm", "you're", "that's", "it's").
@@ -479,55 +481,43 @@ Add the following to `~/.openclaw/openclaw.json`:
479
481
480
482
```json5
481
483
{
482
-
"agents": {
483
-
"defaults": {
484
-
"models": {
485
-
// Voice-only model key — only used by voice-agent below.
486
-
// Other agents (Telegram, WhatsApp, etc.) are NOT affected.
487
-
"anthropic/claude-sonnet-4-5-voice": {
488
-
"params": {
489
-
"maxTokens":512
490
-
}
491
-
}
492
-
}
493
-
},
494
-
"list": [
484
+
agents: {
485
+
list: [
495
486
// Default agent — used by Telegram, WhatsApp, Discord, etc.
496
-
// No model override → uses the global default (8192 maxTokens).
487
+
// No model override → uses the global default (claude-sonnet-4-5).
497
488
{
498
-
"id":"main",
499
-
"default":true
489
+
id:"main",
490
+
default:true,
500
491
},
501
492
// Voice-only agent — ONLY used when channel matches "voice-assistant".
502
-
//Gets the 512-token cap via the dedicated model key above.
493
+
//Uses Haiku for faster time-to-first-token (~2-4x faster than Sonnet).
503
494
{
504
-
"id":"voice-agent",
505
-
"workspace":"~/.openclaw/workspaces/voice-agent",
506
-
"model":"anthropic/claude-sonnet-4-5-voice"
507
-
}
508
-
]
495
+
id:"voice-agent",
496
+
workspace:"~/.openclaw/workspaces/voice-agent",
497
+
model:"anthropic/claude-haiku-4-5",
498
+
},
499
+
],
509
500
},
510
-
"bindings": [
501
+
bindings: [
511
502
// This binding scopes voice-agent to the voice-assistant channel ONLY.
512
503
// All other channels fall through to the default "main" agent.
513
504
{
514
-
"agentId":"voice-agent",
515
-
"match": {
516
-
"channel":"voice-assistant"
517
-
}
518
-
}
519
-
]
505
+
agentId:"voice-agent",
506
+
match: {
507
+
channel:"voice-assistant",
508
+
},
509
+
},
510
+
],
520
511
}
521
512
```
522
513
523
-
**Scoping:**The 512-token limit ONLY applies to the voice channel. Here's why:
514
+
**Why Haiku for voice?**Claude Haiku 4.5 is ~2-4x faster than Sonnet 4.5 in time-to-first-token (~0.5s vs ~1.2-2.0s) and ~3x cheaper, while producing equivalent quality for short conversational responses. For a voice assistant where latency is critical, this makes a significant difference in how responsive it feels.
524
515
525
-
1. The model key `anthropic/claude-sonnet-4-5-voice` (with `maxTokens: 512`) is just an entry in the model catalog — it does nothing unless an agent explicitly references it.
526
-
2. Only `voice-agent` sets `"model": "anthropic/claude-sonnet-4-5-voice"`.
527
-
3. Only the `voice-assistant` channel is bound to `voice-agent` (via the binding).
528
-
4. The default `main` agent (used by Telegram, WhatsApp, Discord, etc.) has no model override, so it uses the global default model with the standard 8192 maxTokens.
516
+
**Scoping:** The Haiku model ONLY applies to the voice channel:
529
517
530
-
> **Tip:** If 512 tokens feels too restrictive (responses getting cut off), bump it to `768` or `1024`. For most spoken responses, 512 tokens (~3-5 sentences) is the sweet spot.
518
+
1. Only `voice-agent` sets `"model": "anthropic/claude-haiku-4-5"`.
519
+
2. Only the `voice-assistant` channel is bound to `voice-agent` (via the binding).
520
+
3. The default `main` agent (used by Telegram, WhatsApp, Discord, etc.) has no model override, so it uses the global default (`anthropic/claude-sonnet-4-5`).
531
521
532
522
### 9c. Restart and Test
533
523
@@ -539,16 +529,22 @@ Say **"Hey Jarvis, tell me about the weather"** — the response should sound na
| Voice mic |`voice-agent`|`claude-haiku-4-5`| Concise (SOUL.md) | Yes |
535
+
| Telegram | Default agent |`claude-sonnet-4-5`| Normal rich text | N/A |
536
+
| Telegram → Voice broadcast | Default agent |`claude-sonnet-4-5`| Normal rich text | Yes |
537
+
538
+
Voice response conciseness is controlled by the **SOUL.md** system prompt, which instructs the LLM to keep responses to 1–6 sentences. Using Haiku instead of Sonnet gives a ~2-4x improvement in time-to-first-token, making the voice assistant feel significantly more responsive.
547
539
548
-
Voice response conciseness is controlled by two independent layers:
540
+
#### Model comparison for voice
549
541
550
-
1.**SOUL.md** (soft control) — instructs the LLM to keep responses to 1–6 sentences. This is the primary lever.
551
-
2.**maxTokens** (hard ceiling) — caps the voice model at 512 tokens, preventing runaway generation even if the LLM ignores the system prompt.
The TTS sanitizer (`tts-sanitize.ts`) acts as a safety net on **all** text reaching the speaker, regardless of which agent generated it. Even if the voice agent's SOUL.md instructions are followed perfectly, the sanitizer ensures no markdown artifacts slip through.
Copy file name to clipboardExpand all lines: docs/channels/voice-assistant.md
+30-30Lines changed: 30 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -98,40 +98,33 @@ When `broadcastAllChannels: true`, messages from ANY channel are spoken via TTS:
98
98
99
99
## Controlling response length
100
100
101
-
Voice responses are spoken aloud, so conciseness matters. There are two layers to control this:
101
+
Voice responses are spoken aloud, so conciseness matters. Two techniques work together:
102
102
103
-
### 1. SOUL.md (primary — soft control)
103
+
### 1. Use a faster model (Haiku)
104
+
105
+
Assign the voice agent a faster, lighter model like Claude Haiku 4.5. It produces equivalent quality for short conversational replies while being ~2-4x faster in time-to-first-token (~0.5s vs ~1.2-2.0s for Sonnet) and 3x cheaper.
106
+
107
+
### 2. SOUL.md (conciseness instructions)
104
108
105
109
Bind a dedicated voice agent with its own workspace and SOUL.md that instructs the LLM to keep responses brief. See [RASPBERRY-PI-SETUP.md](/RASPBERRY-PI-SETUP.md) for a full template. Key rules to include:
106
110
107
-
- Hard sentence limits (1–2 for simple questions, 5–6 max for complex topics)
111
+
- Hard sentence limits (1-2 for simple questions, 5-6 max for complex topics)
108
112
- No markdown formatting
109
113
- Natural speech patterns
110
114
111
-
### 2. maxTokens (secondary — hard ceiling)
112
-
113
-
Create a dedicated model key for the voice agent with a low `maxTokens` value. This prevents runaway generation even if the LLM ignores the system prompt:
115
+
### Example config
114
116
115
117
```json5
116
118
{
117
119
agents: {
118
-
defaults: {
119
-
models: {
120
-
// Voice-only model key — only used by voice-agent below.
121
-
// Telegram, WhatsApp, Discord etc. are NOT affected.
// Default agent — all non-voice channels use Sonnet (global default).
129
122
{ id:"main", default:true },
130
-
// Voice-only agent — scoped to voice-assistant channel via binding.
123
+
// Voice-only agent — uses Haiku for faster responses.
131
124
{
132
125
id:"voice-agent",
133
126
workspace:"~/.openclaw/workspaces/voice-agent",
134
-
model:"anthropic/claude-sonnet-4-5-voice",
127
+
model:"anthropic/claude-haiku-4-5",
135
128
},
136
129
],
137
130
},
@@ -142,7 +135,7 @@ Create a dedicated model key for the voice agent with a low `maxTokens` value. T
142
135
}
143
136
```
144
137
145
-
The 512-token cap ONLY applies to the voice channel. The model key is just a catalog entry — it does nothing unless an agent explicitly references it. Only `voice-agent`does, and only the `voice-assistant` channel is bound to it. All other channels fall through to the default `main` agent with its standard 8192 maxTokens. Start with 512 tokens and adjust up if responses feel cut off.
138
+
The Haiku model ONLY applies to the voice channel. Only `voice-agent`sets it, and only the `voice-assistant` channel is bound to `voice-agent`. All other channels fall through to the default `main` agent using Sonnet.
146
139
147
140
## Configuration
148
141
@@ -152,8 +145,8 @@ The 512-token cap ONLY applies to the voice channel. The model key is just a cat
152
145
{
153
146
channels: {
154
147
"voice-assistant": {
155
-
wsPort:8082, // WebSocket port (default: 8082)
156
-
broadcastAllChannels:true, // Speak messages from all channels
148
+
wsPort:8082, // WebSocket port (default: 8082)
149
+
broadcastAllChannels:true, // Speak messages from all channels
157
150
accounts: {
158
151
default: {
159
152
name:"Pi Voice",
@@ -188,6 +181,7 @@ Connect to: `ws://openclaw-host:8082`
188
181
### Messages: Voice → OpenClaw
189
182
190
183
**Connect (identify device):**
184
+
191
185
```json
192
186
{
193
187
"type": "connect",
@@ -202,6 +196,7 @@ Connect to: `ws://openclaw-host:8082`
202
196
```
203
197
204
198
**Transcription (voice input):**
199
+
205
200
```json
206
201
{
207
202
"type": "transcription",
@@ -212,6 +207,7 @@ Connect to: `ws://openclaw-host:8082`
212
207
```
213
208
214
209
**Ping (keepalive):**
210
+
215
211
```json
216
212
{
217
213
"type": "ping",
@@ -222,6 +218,7 @@ Connect to: `ws://openclaw-host:8082`
222
218
### Messages: OpenClaw → Voice
223
219
224
220
**Connected (handshake response):**
221
+
225
222
```json
226
223
{
227
224
"type": "connected",
@@ -231,6 +228,7 @@ Connect to: `ws://openclaw-host:8082`
231
228
```
232
229
233
230
**Speak (TTS playback):**
231
+
234
232
```json
235
233
{
236
234
"type": "speak",
@@ -242,6 +240,7 @@ Connect to: `ws://openclaw-host:8082`
242
240
```
243
241
244
242
**Pong (keepalive response):**
243
+
245
244
```json
246
245
{
247
246
"type": "pong",
@@ -250,6 +249,7 @@ Connect to: `ws://openclaw-host:8082`
0 commit comments