You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: fern/customization/speech-configuration.mdx
+5-3Lines changed: 5 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,8 +36,8 @@ This plan defines the parameters for when the assistant begins speaking after th
36
36
37
37
In general, turn-taking includes the following tasks:
38
38
39
-
-**End-of-turn prediction** - predicting when the current speaker is likely to finish their turn
40
-
-**Backchannel prediction** - detecting moments where a listener may provide short verbal acknowledgments like "uh-huh", "yeah", etc. to show engagement, without intending to take over the speaking turn.
39
+
-**End-of-turn prediction** - predicting when the current speaker is likely to finish their turn.
40
+
-**Backchannel prediction** - detecting moments where a listener may provide short verbal acknowledgments like "uh-huh", "yeah", etc. to show engagement, without intending to take over the speaking turn. This is better handled by the assistant's stopSpeakingPlan.
41
41
42
42
We offer different providers that can be audio-based, text-based, or audio-text based:
43
43
@@ -51,11 +51,13 @@ This plan defines the parameters for when the assistant begins speaking after th
51
51
52
52
**Audio-text based providers:**
53
53
54
+
-**Deepgram Flux**: Deepgram's latest transcriber model with built-in conversational speech recognition. Flux combines high-quality speech-to-text with native turn detection, while delivering ultra-low latency and Nova-3 level accuracy.
55
+
54
56
-**Assembly**: Transcriber that also reports end-of-turn detection. To use Assembly, choose it as your transcriber without setting a separate smart endpointing plan. As transcripts arrive, we consider the `end_of_turn` flag that Assembly sends to mark the end-of-turn, stream to the LLM, and generate a response.
55
57
56
58
**Text-based providers:**
57
59
58
-
-**Off**: Disabled by default
60
+
-**Off**: Disabled by default. When smart endpointing is set to "Off", the system will automatically use the transcriber's end-of-turn detection if available. If no transcriber EOT detection is available, the system defaults to LiveKit if the language is set to English or to Vapi's standard endpointing mode.
59
61
-**LiveKit**: Recommended for English conversations as it provides the most sophisticated solution for detecting natural speech patterns and pauses. LiveKit can be fine-tuned using the `waitFunction` parameter to adjust response timing based on the probability that the user is still speaking.
60
62
-**Vapi**: Recommended for non-English conversations or as an alternative when LiveKit isn't suitable
@@ -121,6 +122,10 @@ The start speaking plan determines when your assistant begins responding after a
121
122
122
123
Analyzes transcription text to determine user completion based on patterns like punctuation and numbers.
123
124
125
+
<Note>
126
+
This plan is only used if `smartEndpointingPlan` is not set and transcriber does not have built-in endpointing capabilities. If both are provided, `smartEndpointingPlan` takes precedence. This plan will also be overridden by any matching `customEndpointingRules`.
127
+
</Note>
128
+
124
129
<Tabs>
125
130
<Tabtitle="Configuration">
126
131
```json
@@ -159,6 +164,8 @@ Analyzes transcription text to determine user completion based on patterns like
159
164
160
165
Uses AI models to analyze speech patterns, context, and audio cues to predict when users have finished speaking. Only available for English conversations.
161
166
167
+
**Important:** If your transcriber has built-in end-of-turn detection (like Deepgram Flux or Assembly) and you don't configure a smart endpointing plan, the system will automatically use the transcriber's EOT detection instead of smart endpointing.
168
+
162
169
<Tabs>
163
170
<Tabtitle="Configuration">
164
171
```json
@@ -182,17 +189,48 @@ Uses AI models to analyze speech patterns, context, and audio cues to predict wh
182
189
- **krisp**: Audio-based model analyzing prosodic features (intonation, pitch, rhythm)
183
190
184
191
**Audio-text based providers:**
192
+
- **deepgram-flux**: Deepgram's latest transcriber model with built-in conversational speech recognition. (English only)
185
193
- **assembly**: Transcriber with built-in end-of-turn detection (English only)
186
194
187
195
</Tab>
188
196
</Tabs>
189
197
190
198
**When to use:**
191
199
192
-
-**LiveKit**: English conversations requiring sophisticated speech pattern analysis
200
+
-**Deepgram Flux**: English conversations using Deepgram as a transcriber.
201
+
-**Assembly**: Best used when Assembly is already your transcriber provider for English conversations with integrated end-of-turn detection
202
+
-**LiveKit**: English conversations where Deepgram is not the transcriber of choice.
193
203
-**Vapi**: Non-English conversations with default stop speaking plan settings
194
204
-**Krisp**: Non-English conversations with a robustly configured stop speaking plan
195
-
-**Assembly**: Best used when Assembly is already your transcriber provider for English conversations with integrated end-of-turn detection
205
+
206
+
### Deepgram Flux configuration
207
+
208
+
Deepgram Flux's end-of-turn detection is configured at the transcriber level, allowing you to fine-tune how aggressive or conservative the bot should be in detecting when users finish speaking.
209
+
210
+
**Configuration parameters:**
211
+
212
+
-**eotThreshold** (Default: 0.7): Confidence level required to trigger end-of-turn detection
213
+
-**0.3-0.5:** Aggressive detection - responds quickly but may interrupt users mid-sentence
214
+
-**0.6-0.8:** Balanced detection (default: 0.7) - good balance between responsiveness and accuracy
215
+
-**0.9-1.0:** Conservative detection - waits longer to ensure users have finished speaking
216
+
217
+
-**eotTimeoutMs** (Default: 5000): Maximum wait time in milliseconds before forcing turn end
218
+
-**2000-3000:** Fast timeout for quick interactions
219
+
-**4000-6000:** Standard timeout (default: 5000) - natural conversation flow
220
+
-**7000-10000:** Extended timeout for complex or thoughtful responses
221
+
222
+
**Configuration example:**
223
+
224
+
```json
225
+
{
226
+
"transcriber": {
227
+
"provider": "flux-general-en",
228
+
"language": "en",
229
+
"eotThreshold": 0.7,
230
+
"eotTimeoutMs": 5000
231
+
}
232
+
}
233
+
```
196
234
197
235
### LiveKit's Wait function
198
236
@@ -228,6 +266,44 @@ Mathematical expression that determines wait time based on speech completion pro
-**Timing:**~2700ms wait at 50% confidence, ~700ms at 90% confidence
230
268
269
+
### Vapi heuristic endpointing
270
+
271
+
Vapi's text-based endpointing uses heuristic rules to analyze transcription patterns and determine when users have finished speaking. The system applies these rules in priority order using the `transcriptionEndpointingPlan` settings:
272
+
273
+
**Heuristic priority order:**
274
+
275
+
1.**Number detection**: If the latest message ends with a number, waits for `onNumberSeconds` (default: 0.5)
276
+
2.**Punctuation detection**: If the message contains punctuation, waits for `onPunctuationSeconds` (default: 0.1)
277
+
3.**No punctuation fallback**: If no punctuation is detected, waits for `onNoPunctuationSeconds` (default: 1.5)
278
+
4.**Default**: If no rules match, waits 0ms (immediate response)
279
+
280
+
**How it works:**
281
+
282
+
The system continuously analyzes the latest user message and applies the first matching rule. Each rule sets a specific timeout delay before triggering the end-of-turn event.
283
+
284
+
**Configuration example:**
285
+
286
+
```json
287
+
{
288
+
"startSpeakingPlan": {
289
+
"smartEndpointingPlan": {
290
+
"provider": "vapi"
291
+
},
292
+
"transcriptionEndpointingPlan": {
293
+
"onPunctuationSeconds": 0.1,
294
+
"onNoPunctuationSeconds": 1.5,
295
+
"onNumberSeconds": 0.5
296
+
}
297
+
}
298
+
}
299
+
```
300
+
301
+
**When to use:**
302
+
303
+
- Non-English conversations where LiveKit isn't available
0 commit comments