Skip to content

Commit b8c5254

Browse files
chore: add flux and vapi docs for endpointing (#749)
1 parent 9563fd1 commit b8c5254

File tree

2 files changed

+114
-8
lines changed

2 files changed

+114
-8
lines changed

fern/customization/speech-configuration.mdx

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,8 @@ This plan defines the parameters for when the assistant begins speaking after th
3636

3737
In general, turn-taking includes the following tasks:
3838

39-
- **End-of-turn prediction** - predicting when the current speaker is likely to finish their turn
40-
- **Backchannel prediction** - detecting moments where a listener may provide short verbal acknowledgments like "uh-huh", "yeah", etc. to show engagement, without intending to take over the speaking turn.
39+
- **End-of-turn prediction** - predicting when the current speaker is likely to finish their turn.
40+
- **Backchannel prediction** - detecting moments where a listener may provide short verbal acknowledgments like "uh-huh", "yeah", etc. to show engagement, without intending to take over the speaking turn. This is better handled by the assistant's stopSpeakingPlan.
4141

4242
We offer different providers that can be audio-based, text-based, or audio-text based:
4343

@@ -51,11 +51,13 @@ This plan defines the parameters for when the assistant begins speaking after th
5151

5252
**Audio-text based providers:**
5353

54+
- **Deepgram Flux**: Deepgram's latest transcriber model with built-in conversational speech recognition. Flux combines high-quality speech-to-text with native turn detection, while delivering ultra-low latency and Nova-3 level accuracy.
55+
5456
- **Assembly**: Transcriber that also reports end-of-turn detection. To use Assembly, choose it as your transcriber without setting a separate smart endpointing plan. As transcripts arrive, we consider the `end_of_turn` flag that Assembly sends to mark the end-of-turn, stream to the LLM, and generate a response.
5557

5658
**Text-based providers:**
5759

58-
- **Off**: Disabled by default
60+
- **Off**: Disabled by default. When smart endpointing is set to "Off", the system will automatically use the transcriber's end-of-turn detection if available. If no transcriber EOT detection is available, the system defaults to LiveKit if the language is set to English or to Vapi's standard endpointing mode.
5961
- **LiveKit**: Recommended for English conversations as it provides the most sophisticated solution for detecting natural speech patterns and pauses. LiveKit can be fine-tuned using the `waitFunction` parameter to adjust response timing based on the probability that the user is still speaking.
6062
- **Vapi**: Recommended for non-English conversations or as an alternative when LiveKit isn't suitable
6163

fern/customization/voice-pipeline-configuration.mdx

Lines changed: 109 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -86,9 +86,10 @@ User Audio → VAD → Transcription → Start Speaking Decision → LLM → TTS
8686
Voice Activity Detection (VAD) detects utterance-stop
8787
</Step>
8888
<Step title="Endpointing decision">
89-
System evaluates completion using: - Custom Rules (highest priority) - Smart
90-
Endpointing Plan (LiveKit for English) - Transcription Endpointing Plan
91-
(fallback)
89+
System evaluates completion using this priority order:
90+
1. **Transcriber EOT detection** (if transcriber has built-in EOT and no smart endpointing plan)
91+
2. **Custom Rules** (highest priority when configured)
92+
3. **Smart Endpointing Plan** (LiveKit for English, Vapi for non-English)
9293
</Step>
9394
<Step title="Response generation">
9495
LLM request sent immediately → TTS processes → waitSeconds applied →
@@ -121,6 +122,10 @@ The start speaking plan determines when your assistant begins responding after a
121122

122123
Analyzes transcription text to determine user completion based on patterns like punctuation and numbers.
123124

125+
<Note>
126+
This plan is only used if `smartEndpointingPlan` is not set and transcriber does not have built-in endpointing capabilities. If both are provided, `smartEndpointingPlan` takes precedence. This plan will also be overridden by any matching `customEndpointingRules`.
127+
</Note>
128+
124129
<Tabs>
125130
<Tab title="Configuration">
126131
```json
@@ -159,6 +164,8 @@ Analyzes transcription text to determine user completion based on patterns like
159164

160165
Uses AI models to analyze speech patterns, context, and audio cues to predict when users have finished speaking. Only available for English conversations.
161166

167+
**Important:** If your transcriber has built-in end-of-turn detection (like Deepgram Flux or Assembly) and you don't configure a smart endpointing plan, the system will automatically use the transcriber's EOT detection instead of smart endpointing.
168+
162169
<Tabs>
163170
<Tab title="Configuration">
164171
```json
@@ -182,17 +189,48 @@ Uses AI models to analyze speech patterns, context, and audio cues to predict wh
182189
- **krisp**: Audio-based model analyzing prosodic features (intonation, pitch, rhythm)
183190

184191
**Audio-text based providers:**
192+
- **deepgram-flux**: Deepgram's latest transcriber model with built-in conversational speech recognition. (English only)
185193
- **assembly**: Transcriber with built-in end-of-turn detection (English only)
186194

187195
</Tab>
188196
</Tabs>
189197

190198
**When to use:**
191199

192-
- **LiveKit**: English conversations requiring sophisticated speech pattern analysis
200+
- **Deepgram Flux**: English conversations using Deepgram as a transcriber.
201+
- **Assembly**: Best used when Assembly is already your transcriber provider for English conversations with integrated end-of-turn detection
202+
- **LiveKit**: English conversations where Deepgram is not the transcriber of choice.
193203
- **Vapi**: Non-English conversations with default stop speaking plan settings
194204
- **Krisp**: Non-English conversations with a robustly configured stop speaking plan
195-
- **Assembly**: Best used when Assembly is already your transcriber provider for English conversations with integrated end-of-turn detection
205+
206+
### Deepgram Flux configuration
207+
208+
Deepgram Flux's end-of-turn detection is configured at the transcriber level, allowing you to fine-tune how aggressive or conservative the bot should be in detecting when users finish speaking.
209+
210+
**Configuration parameters:**
211+
212+
- **eotThreshold** (Default: 0.7): Confidence level required to trigger end-of-turn detection
213+
- **0.3-0.5:** Aggressive detection - responds quickly but may interrupt users mid-sentence
214+
- **0.6-0.8:** Balanced detection (default: 0.7) - good balance between responsiveness and accuracy
215+
- **0.9-1.0:** Conservative detection - waits longer to ensure users have finished speaking
216+
217+
- **eotTimeoutMs** (Default: 5000): Maximum wait time in milliseconds before forcing turn end
218+
- **2000-3000:** Fast timeout for quick interactions
219+
- **4000-6000:** Standard timeout (default: 5000) - natural conversation flow
220+
- **7000-10000:** Extended timeout for complex or thoughtful responses
221+
222+
**Configuration example:**
223+
224+
```json
225+
{
226+
"transcriber": {
227+
"provider": "flux-general-en",
228+
"language": "en",
229+
"eotThreshold": 0.7,
230+
"eotTimeoutMs": 5000
231+
}
232+
}
233+
```
196234

197235
### LiveKit's Wait function
198236

@@ -228,6 +266,44 @@ Mathematical expression that determines wait time based on speech completion pro
228266
- **Use case:** Healthcare, formal settings, sensitive conversations
229267
- **Timing:** ~2700ms wait at 50% confidence, ~700ms at 90% confidence
230268

269+
### Vapi heuristic endpointing
270+
271+
Vapi's text-based endpointing uses heuristic rules to analyze transcription patterns and determine when users have finished speaking. The system applies these rules in priority order using the `transcriptionEndpointingPlan` settings:
272+
273+
**Heuristic priority order:**
274+
275+
1. **Number detection**: If the latest message ends with a number, waits for `onNumberSeconds` (default: 0.5)
276+
2. **Punctuation detection**: If the message contains punctuation, waits for `onPunctuationSeconds` (default: 0.1)
277+
3. **No punctuation fallback**: If no punctuation is detected, waits for `onNoPunctuationSeconds` (default: 1.5)
278+
4. **Default**: If no rules match, waits 0ms (immediate response)
279+
280+
**How it works:**
281+
282+
The system continuously analyzes the latest user message and applies the first matching rule. Each rule sets a specific timeout delay before triggering the end-of-turn event.
283+
284+
**Configuration example:**
285+
286+
```json
287+
{
288+
"startSpeakingPlan": {
289+
"smartEndpointingPlan": {
290+
"provider": "vapi"
291+
},
292+
"transcriptionEndpointingPlan": {
293+
"onPunctuationSeconds": 0.1,
294+
"onNoPunctuationSeconds": 1.5,
295+
"onNumberSeconds": 0.5
296+
}
297+
}
298+
}
299+
```
300+
301+
**When to use:**
302+
303+
- Non-English conversations where LiveKit isn't available
304+
- Scenarios requiring predictable, rule-based endpointing behavior
305+
- Fallback option when other smart endpointing providers aren't suitable
306+
231307
### Krisp threshold configuration
232308

233309
Krisp's audio-base model returns a probability between 0 and 1, where 1 means the user definitely stopped speaking and 0 means they're still speaking.
@@ -588,6 +664,34 @@ User Interrupts → Assistant Audio Stopped → backoffSeconds Blocks All Output
588664

589665
**Optimized for:** English conversations with integrated transcriber and sophisticated end-of-turn detection.
590666

667+
### Audio-text based endpointing (Deepgram Flux example)
668+
669+
```json
670+
{
671+
"transcriber": {
672+
"provider": "flux-general-en",
673+
"language": "en",
674+
"eotThreshold": 0.7,
675+
"eotTimeoutMs": 5000,
676+
},
677+
"stopSpeakingPlan": {
678+
"numWords": 2,
679+
"voiceSeconds": 0.2,
680+
"backoffSeconds": 1.0
681+
"acknowledgementPhrases": [
682+
"okay",
683+
"right",
684+
"uh-huh",
685+
"yeah",
686+
"mm-hmm",
687+
"got it"
688+
]
689+
}
690+
}
691+
```
692+
693+
**Optimized for:** English conversations where Deepgram is set as transcriber.
694+
591695
### Education and training
592696

593697
```json

0 commit comments

Comments
 (0)