Skip to content

Commit 3c2ac18

Browse files
VAP-8954 Add Krisp and Assembly to smartEndpoitingPlan docs (#680)
* chore: add krisp and assembly docs * chore: add Assembly disclaimer --------- Co-authored-by: Dan Goosewin <[email protected]>
1 parent bec0e92 commit 3c2ac18

File tree

2 files changed

+183
-28
lines changed

2 files changed

+183
-28
lines changed

fern/customization/speech-configuration.mdx

Lines changed: 40 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -6,31 +6,55 @@ slug: customization/speech-configuration
66

77
## Overview
88

9-
Speech configuration lets you control exactly when your assistant starts and stops speaking during a conversation. By tuning these settings, you can make your assistant feel more natural, avoid interrupting the customer, and reduce awkward pauses.
9+
Speech configuration lets you control exactly when your assistant starts and stops speaking during a conversation. By tuning these settings, you can make your assistant feel more natural, avoid interrupting the customer, and reduce awkward pauses.
1010

1111
<Note>
12-
Speech speed can be controlled, but only PlayHT currently supports this feature with the `speed` field. Other providers do not currently support speed.
12+
Speech speed can be controlled, but only PlayHT currently supports this
13+
feature with the `speed` field. Other providers do not currently support
14+
speed.
1315
</Note>
1416

1517
The two main components are:
18+
1619
- **Speaking Plan**: Controls when the assistant begins speaking after the customer finishes or pauses.
1720
- **Stop Speaking Plan**: Controls when the assistant stops speaking if the customer starts talking.
1821

19-
Fine-tuning these plans helps you adapt the assistant's responsiveness to your use case—whether you want fast, snappy replies or a more patient, human-like conversation flow.
22+
Fine-tuning these plans helps you adapt the assistant's responsiveness to your use case—whether you want fast, snappy replies or a more patient, human-like conversation flow.
2023

21-
<Note>
22-
Currently, these configurations can only be set via API.
23-
</Note>
24+
<Note>Currently, these configurations can only be set via API.</Note>
2425

2526
The rest of this page explains each setting and provides practical examples for different scenarios.
2627

2728
## Start Speaking Plan
29+
2830
This plan defines the parameters for when the assistant begins speaking after the customer pauses or finishes.
2931

3032
- **Wait Time Before Speaking**: You can set how long the assistant waits before speaking after the customer finishes. The default is 0.4 seconds, but you can increase it if the assistant is speaking too soon, or decrease it if there's too much delay.
31-
**Example:** For tech support calls, set `waitSeconds` for the assistant to more than 1.0 seconds to give customers time to complete their thoughts, even if they have some pauses in between.
33+
**Example:** For tech support calls, set `waitSeconds` for the assistant to more than 1.0 seconds to give customers time to complete their thoughts, even if they have some pauses in between.
34+
35+
- **Smart Endpointing Plan**: This feature uses advanced processing to detect when the customer has truly finished speaking, especially if they pause mid-thought.
36+
37+
In general, turn-taking includes the following tasks:
38+
39+
- **End-of-turn prediction** - predicting when the current speaker is likely to finish their turn
40+
- **Backchannel prediction** - detecting moments where a listener may provide short verbal acknowledgments like "uh-huh", "yeah", etc. to show engagement, without intending to take over the speaking turn.
41+
42+
We offer different providers that can be audio-based, text-based, or audio-text based:
43+
44+
**Audio-based providers:**
45+
46+
- **Krisp**: Audio-based model that analyzes prosodic and acoustic features such as changes in intonation, pitch, and rhythm to detect when users finish speaking. Since it's audio-based, it always notifies when the user is done speaking, even for brief acknowledgments. Vapi offers configurable acknowledgement words and a well-configured stop speaking plan to handle this properly.
47+
48+
Configure Krisp with a threshold between 0 and 1 (default 0.5), where 1 means the user definitely stopped speaking and 0 means they're still speaking. Use lower values for snappier conversations and higher values for more conservative detection.
49+
50+
When interacting with an AI agent, users may genuinely want to interrupt to ask a question or shift the conversation, or they might simply be using backchannel cues like "right" or "okay" to signal they're actively listening. The core challenge lies in distinguishing meaningful interruptions from casual acknowledgments. Since the audio-based model signals end-of-turn after each word, configure the stop speaking plan with the right number of words to interrupt, interruption settings, and acknowledgement phrases to handle backchanneling properly.
51+
52+
**Audio-text based providers:**
53+
54+
- **Assembly**: Transcriber that also reports end-of-turn detection. To use Assembly, choose it as your transcriber without setting a separate smart endpointing plan. As transcripts arrive, we consider the `end_of_turn` flag that Assembly sends to mark the end-of-turn, stream to the LLM, and generate a response.
55+
56+
**Text-based providers:**
3257

33-
- **Smart Endpointing Plan**: This feature uses advanced processing to detect when the customer has truly finished speaking, especially if they pause mid-thought. It can be configured in three ways:
3458
- **Off**: Disabled by default
3559
- **LiveKit**: Recommended for English conversations as it provides the most sophisticated solution for detecting natural speech patterns and pauses. LiveKit can be fine-tuned using the `waitFunction` parameter to adjust response timing based on the probability that the user is still speaking.
3660
- **Vapi**: Recommended for non-English conversations or as an alternative when LiveKit isn't suitable
@@ -39,51 +63,48 @@ This plan defines the parameters for when the assistant begins speaking after th
3963

4064
**LiveKit Smart Endpointing Configuration:**
4165
When using LiveKit, you can customize the `waitFunction` parameter which determines how long the bot will wait to start speaking based on the likelihood that the user has finished speaking:
42-
66+
4367
```
4468
waitFunction: "200 + 8000 * x"
4569
```
46-
70+
4771
This function maps probabilities (0-1) to milliseconds of wait time. A probability of 0 means high confidence the caller has stopped speaking, while 1 means high confidence they're still speaking. The default function (`200 + 8000 * x`) creates a wait time between 200ms (when x=0) and 8200ms (when x=1). You can customize this with your own mathematical expression, such as `4000 * (1 - cos(pi * x))` for a different response curve.
4872

4973
**Example:** In insurance claims, smart endpointing helps avoid interruptions while customers think through complex responses. For instance, when the assistant asks "do you want a loan," the system can intelligently wait for the complete response rather than interrupting after the initial "yes" or "no." For responses requiring number sequences like "What's your account number?", the system can detect natural pauses between digits without prematurely ending the customer's turn to speak.
5074

51-
- **Transcription-Based Detection**: Customize how the assistant determines that the customer has stopped speaking based on what they're saying. This offers more control over the timing. **Example:** When a customer says, "My account number is 123456789, I want to transfer $500."
75+
- **Transcription-Based Detection**: Customize how the assistant determines that the customer has stopped speaking based on what they're saying. This offers more control over the timing. **Example:** When a customer says, "My account number is 123456789, I want to transfer $500."
5276
- The system detects the number "123456789" and waits for 0.5 seconds (`WaitSeconds`) to ensure the customer isn't still speaking.
5377
- If the customer were to finish with an additional line, "I want to transfer $500.", the system uses `onPunctuationSeconds` to confirm the end of the speech and then proceed with the request processing.
54-
- In a scenario where the customer has been silent for a long and has already finished speaking but the transcriber is not confident to punctuate the transcription, `onNoPunctuationSeconds` is used for 1.5 seconds.
55-
78+
- In a scenario where the customer has been silent for a long and has already finished speaking but the transcriber is not confident to punctuate the transcription, `onNoPunctuationSeconds` is used for 1.5 seconds.
5679

5780
## Stop Speaking Plan
81+
5882
The Stop Speaking Plan defines when the assistant stops talking after detecting customer speech.
5983

6084
- **Words to Stop Speaking**: Define how many words the customer needs to say before the assistant stops talking. If you want immediate reaction, set this to 0. Increase it to avoid interruptions by brief acknowledgments like "okay" or "right". **Example:** While setting an appointment with a clinic, set `numWords` to 2-3 words to allow customers to finish brief clarifications without triggering interruptions.
6185

6286
- **Voice Activity Detection**: Adjust how long the customer needs to be speaking before the assistant stops. The default is 0.2 seconds, but you can tweak this to balance responsiveness and avoid false triggers.
63-
**Example:** For a banking call center, setting a higher `voiceSeconds` value ensures accuracy by reducing false positives. This avoids interruptions caused by background sounds, even if it slightly delays the detection of speech onset. This tradeoff is essential to ensure the assistant processes only correct and intended information.
64-
87+
**Example:** For a banking call center, setting a higher `voiceSeconds` value ensures accuracy by reducing false positives. This avoids interruptions caused by background sounds, even if it slightly delays the detection of speech onset. This tradeoff is essential to ensure the assistant processes only correct and intended information.
6588

6689
- **Pause Before Resuming**: Control how long the assistant waits before starting to talk again after being interrupted. The default is 1 second, but you can adjust it depending on how quickly the assistant should resume.
67-
**Example:** For quick queries (e.g., "What's the total order value in my cart?"), set `backoffSeconds` to 1 second.
90+
**Example:** For quick queries (e.g., "What's the total order value in my cart?"), set `backoffSeconds` to 1 second.
6891

6992
Here's a code snippet for Stop Speaking Plan -
7093

7194
```json
7295
"stopSpeakingPlan": {
7396
"numWords": 0,
7497
"voiceSeconds": 0.2,
75-
"backoffSeconds": 1
98+
"backoffSeconds": 1
7699
}
77100
```
78101

79-
80102
## Considerations for Configuration
81103

82104
- **Customer Style**: Think about whether the customer pauses mid-thought or provides continuous speech. Adjust wait times and enable smart endpointing as needed.
83105

84106
- **Background Noise**: If there's a lot of background noise, you may need to tweak the settings to avoid false triggers. Default for phone calls is 'office' and default for web calls is 'off'.
85107

86-
87108
```json
88109
"backgroundSound": "off",
89110
```

fern/customization/voice-pipeline-configuration.mdx

Lines changed: 143 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -174,22 +174,27 @@ Uses AI models to analyze speech patterns, context, and audio cues to predict wh
174174
```
175175
</Tab>
176176
<Tab title="Providers">
177-
**livekit**
178-
Advanced model trained on conversation data (recommended for English)
179-
180-
**vapi**
181-
Alternative VAPI-trained model
177+
**Text-based providers:**
178+
- **livekit**: Advanced model trained on conversation data (English only)
179+
- **vapi**: VAPI-trained model (non-English conversations or LiveKit alternative)
180+
181+
**Audio-based providers:**
182+
- **krisp**: Audio-based model analyzing prosodic features (intonation, pitch, rhythm)
183+
184+
**Audio-text based providers:**
185+
- **assembly**: Transcriber with built-in end-of-turn detection (English only)
182186

183187
</Tab>
184188
</Tabs>
185189

186190
**When to use:**
187191

188-
- English conversations
189-
- Natural conversation flow requirements
190-
- Reduced false endpointing triggers
192+
- **LiveKit**: English conversations requiring sophisticated speech pattern analysis
193+
- **Vapi**: Non-English conversations with default stop speaking plan settings
194+
- **Krisp**: Non-English conversations with a robustly configured stop speaking plan
195+
- **Assembly**: Best used when Assembly is already your transcriber provider for English conversations with integrated end-of-turn detection
191196

192-
### Wait function
197+
### LiveKit's Wait function
193198

194199
Mathematical expression that determines wait time based on speech completion probability. The function takes a confidence value (0-1) and returns a wait time in milliseconds.
195200

@@ -223,6 +228,83 @@ Mathematical expression that determines wait time based on speech completion pro
223228
- **Use case:** Healthcare, formal settings, sensitive conversations
224229
- **Timing:** ~2700ms wait at 50% confidence, ~700ms at 90% confidence
225230

231+
### Krisp threshold configuration
232+
233+
Krisp's audio-base model returns a probability between 0 and 1, where 1 means the user definitely stopped speaking and 0 means they're still speaking.
234+
235+
**Threshold settings:**
236+
237+
- **0.0-0.3:** Very aggressive detection - responds quickly but may interrupt users mid-sentence
238+
- **0.4-0.6:** Balanced detection (default: 0.5) - good balance between responsiveness and accuracy
239+
- **0.7-1.0:** Conservative detection - waits longer to ensure users have finished speaking
240+
241+
**Configuration example:**
242+
243+
```json
244+
{
245+
"startSpeakingPlan": {
246+
"smartEndpointingPlan": {
247+
"provider": "krisp",
248+
"threshold": 0.5
249+
}
250+
}
251+
}
252+
```
253+
254+
**Important considerations:**
255+
Since Krisp is audio-based, it always notifies when the user is done speaking, even for brief acknowledgments. Configure the stop speaking plan with appropriate `acknowledgementPhrases` and `numWords` settings to handle backchanneling properly.
256+
257+
### Assembly turn detection
258+
259+
AssemblyAI's turn detection model uses a neural network to detect when someone has finished speaking. The model understands the meaning and flow of speech to make better decisions about when a turn has ended.
260+
261+
When the model detects an end-of-turn, it returns `end_of_turn=True` in the response.
262+
263+
**Quick start configurations:**
264+
265+
To use Assembly's turn detection, set Assembly as your transcriber provider and configure these fields in the assistant's transcriber (**do not set any smartEndpointingPlan**):
266+
267+
**Aggressive (Fast Response):**
268+
269+
```json
270+
{
271+
"endOfTurnConfidenceThreshold": 0.4,
272+
"minEndOfTurnSilenceWhenConfident": 160,
273+
"maxTurnSilence": 400
274+
}
275+
```
276+
277+
- **Use cases:** Agent Assist, IVR replacements, Retail/E-commerce, Telecom
278+
- **Behavior:** Ends turns very quickly, optimized for short responses
279+
280+
**Balanced (Natural Flow):**
281+
282+
```json
283+
{
284+
"endOfTurnConfidenceThreshold": 0.4,
285+
"minEndOfTurnSilenceWhenConfident": 400,
286+
"maxTurnSilence": 1280
287+
}
288+
```
289+
290+
- **Use cases:** Customer Support, Tech Support, Financial Services, Travel & Hospitality
291+
- **Behavior:** Natural middle ground, allowing enough pause for conversational turns
292+
293+
**Conservative (Patient Response):**
294+
295+
```json
296+
{
297+
"endOfTurnConfidenceThreshold": 0.7,
298+
"minEndOfTurnSilenceWhenConfident": 800,
299+
"maxTurnSilence": 3600
300+
}
301+
```
302+
303+
- **Use cases:** Healthcare, Mental Health Support, Sales & Consulting, Legal & Insurance
304+
- **Behavior:** Holds the floor longer, optimized for reflective or complex speech
305+
306+
For detailed information about how Assembly's turn detection works, see the [AssemblyAI Turn Detection documentation](https://www.assemblyai.com/docs/speech-to-text/universal-streaming/turn-detection).
307+
226308
### Wait seconds
227309

228310
Final audio delay applied after all processing completes, before the assistant speaks.
@@ -454,6 +536,58 @@ User Interrupts → Assistant Audio Stopped → backoffSeconds Blocks All Output
454536

455537
**Optimized for:** Text-based endpointing with longer timeouts for different speech patterns and international support.
456538

539+
### Audio-based endpointing (Krisp example)
540+
541+
```json
542+
{
543+
"startSpeakingPlan": {
544+
"waitSeconds": 0.4,
545+
"smartEndpointingPlan": {
546+
"provider": "krisp",
547+
"threshold": 0.5
548+
}
549+
},
550+
"stopSpeakingPlan": {
551+
"numWords": 2,
552+
"voiceSeconds": 0.2,
553+
"backoffSeconds": 1.0,
554+
"acknowledgementPhrases": [
555+
"okay",
556+
"right",
557+
"uh-huh",
558+
"yeah",
559+
"mm-hmm",
560+
"got it"
561+
]
562+
}
563+
}
564+
```
565+
566+
**Optimized for:** Non-English conversations with robust backchanneling configuration to handle audio-based detection limitations.
567+
568+
### Audio-text based endpointing (Assembly example)
569+
570+
```json
571+
{
572+
"transcriber": {
573+
"provider": "assembly",
574+
"endOfTurnConfidenceThreshold": 0.4,
575+
"minEndOfTurnSilenceWhenConfident": 400,
576+
"maxTurnSilence": 1280
577+
},
578+
"startSpeakingPlan": {
579+
"waitSeconds": 0.4
580+
},
581+
"stopSpeakingPlan": {
582+
"numWords": 0,
583+
"voiceSeconds": 0.2,
584+
"backoffSeconds": 1.0
585+
}
586+
}
587+
```
588+
589+
**Optimized for:** English conversations with integrated transcriber and sophisticated end-of-turn detection.
590+
457591
### Education and training
458592

459593
```json

0 commit comments

Comments
 (0)