Revamp speech configuration documentation; add detailed examples and best practices.

vmarshall · vmarshall · commit 1097960da196 · 2024-12-23T20:04:52.000-08:00
diff --git a/fern/customization/speech-configuration.mdx b/fern/customization/speech-configuration.mdx
@@ -4,81 +4,261 @@ subtitle: Timing control for assistant speech
 slug: customization/speech-configuration
 ---
 
+### Introduction
 
-The Speaking Plan and Stop Speaking Plan are essential configurations designed to optimize the timing of when the assistant begins and stops speaking during interactions with a customer. These plans ensure that the assistant does not interrupt the customer and also prevents awkward pauses that can occur if the assistant starts speaking too late. Adjusting these parameters helps tailor the assistant’s responsiveness to different conversational dynamics.
+Conversation Analysis (CA) examines the structure and organization of human interactions, focusing on how participants manage conversations in real-time. We mimic this natural behavior in our API.
 
-**Note**: At the moment these configurations can currently only be made via API.
+Key concepts include:
 
-## Start Speaking Plan
-This plan defines the parameters for when the assistant begins speaking after the customer pauses or finishes.
+<AccordionGroup>
+  
+  <Accordion title="Turn-Taking Organization">
+  Conversations are structured into turns, where typically one person speaks at a time. Speakers use Turn Construction Units (TCUs)—such as words, phrases, or clauses—that listeners recognize, allowing them to anticipate when a turn will end and when it's appropriate to speak. Transition Relevance Places (TRPs) are points where a change of speaker can occur. Turn allocation follows specific rules:
 
+  - **Current speaker selects next**: The current speaker designates who speaks next.
+  - **Self-selection**: If not selected, another participant may self-select to speak.
+  - **Continuation**: If no one else speaks, the current speaker may continue.
 
-- **Wait Time Before Speaking**: You can set how long the assistant waits before speaking after the customer finishes. The default is 0.4 seconds, but you can increase it if the assistant is speaking too soon, or decrease it if there’s too much delay.
-**Example:** For tech support calls, set `waitSeconds` for the assistant to more than 1.0 seconds to give customers time to complete their thoughts, even if they have some pauses in between.
+  Silences are categorized as pauses (within a turn), gaps (between turns), or lapses (when no one speaks).
+  </Accordion>
+  <Accordion title="Sequence Organization">
+  Conversations often involve sequences like adjacency pairs, where an initial utterance (e.g., a question) prompts a related response (e.g., an answer). These pairs can be expanded with pre-sequences (preparing for the main action), insert expansions (occurring between the initial and responsive actions), and post-expansions (following the main action).
+  </Accordion>
+  <Accordion title="Preference Organization">
+  Certain responses are socially preferred. For example, agreements or acceptances are typically delivered promptly and directly, while disagreements or refusals may be delayed or mitigated to maintain social harmony.
+  </Accordion>
+  <Accordion title="Repair Mechanisms">
+  Participants address problems in speaking, hearing, or understanding through repair strategies. Self-repair (the speaker corrects themselves) is generally preferred over other-repair (another person corrects the speaker), helping to maintain conversational flow and mutual understanding.
+  </Accordion>
+  <Accordion title="Action Formation">
+  Speakers perform actions (e.g., questioning, requesting, asserting) through their utterances. Understanding how these actions are constructed and interpreted is central to CA, as it reveals how participants achieve social objectives through conversation.
+  </Accordion>
+  <Accordion title="Adjacency Pair">
+  An adjacency pair is a fundamental unit of conversation consisting of two related utterances. The first part (e.g., a question) typically elicits a specific response (e.g., an answer). These pairs are essential for structuring conversations and ensuring coherence.
+  </Accordion>
+</AccordionGroup>
 
-- **Smart Endpointing**: This feature uses advanced processing to detect when the customer has truly finished speaking, especially if they pause mid-thought. It’s off by default but can be turned on if needed. **Example:** In insurance claims, `smartendpointingEnabled` helps avoid interruptions while customers think through responses and as they formulate responses. Example: The assistant mentions "do you want a loan," triggering the system to check the customer's response. If the customer responds with "yes" (matching the CustomerRegex for "yes|no"), the system waits for 1.1 seconds before proceeding, allowing time for further clarification. For responses requiring number sequences like "What’s your account number?", set longer timeouts like 5 seconds or more to accommodate pauses between digits.
+These foundational structures illustrate how individuals collaboratively produce and interpret talk in interaction, ensuring coherent and meaningful communication.
 
-- **Transcription-Based Detection**: Customize how the assistant determines that the customer has stopped speaking based on what they’re saying. This offers more control over the timing. **Example:** When a customer says, "My account number is 123456789, I want to transfer $500." 
-  - The system detects the number "123456789" and waits for 0.5 seconds (`WaitSeconds`) to ensure the customer isn't still speaking.
-  - If the customer were to finish with an additional line, "I want to transfer $500.", the system uses `onPunctuationSeconds` to confirm the end of the speech and then proceed with the request processing.
-  - In a scenario where the customer has been silent for a long and has already finished speaking but the transcriber is not confident to punctuate the transcription, `onNoPunctuationSeconds` is used for 1.5 seconds. 
+### Speech Configuration in VAPI
 
-Here's a code snippet for Start Speaking Plan -
+Speech configuration is a crucial aspect of designing a voice assistant that delivers a seamless and engaging user experience. By customizing the assistant's speech settings, you can optimize its responsiveness, naturalness, and timing during interactions with users.
+
+The speech configuration options in Vapi allows you to control the timing of when the assistant starts and stops speaking during interactions with a customer.
+
+These plans ensure that the assistant does not interrupt the customer and also prevents awkward pauses that can occur if the assistant starts speaking too late.
+
+Adjusting these parameters helps tailor the assistant's responsiveness to different conversational dynamics.
+
+<CardGroup cols={2}>
+
+<Card 
+  title='Transcriber Settings' 
+  icon='solid code' 
+  href='#transcriber-settings'
+>
+  Specify the provider, language, and model for speech transcription.
+
+  <Tip 
+    title='API Endpoint'>
+    [rest](https://docs.vapi.ai/api-reference/assistants/create#request.body.speech)
+  </Tip>
+</Card>
+
+<Card 
+  title='Voice Settings' 
+  icon='solid microphone' 
+  href='#voice-settings'>
+
+  Choose the voice provider, voice ID, and other related parameters.
+
+ <Tip 
+    title='API Endpoint'>
+    [rest](https://docs.vapi.ai/api-reference/assistants/create#request.body.speech)
+  </Tip>
+  
+ </Card>
+
+<Card 
+  title='Behavioral Settings' 
+  icon='solid dog' 
+  href='#behavioral-settings'>
+
+  Set parameters like `firstMessage`, `silenceTimeoutSeconds`, and `maxDurationSeconds` to control the assistant's behavior during interactions.
+
+<Tip 
+    title='API Endpoint'>
+    [rest](https://docs.vapi.ai/api-reference/assistants/create#request.body.speech)
+  </Tip>
+ </Card>
+
+<Card 
+  title='Timing Controls' 
+  icon='solid clock' 
+  href='#timing-controls'>
+
+  Configure the Speaking Plan and Stop Speaking Plan to optimize the assistant's speech timing. Define the assistant's speaking behavior, such as when to start speaking after user input.
+
+  <Tip 
+  title='API Endpoint'>
+  
+  [endpoint](https://docs.vapi.ai/api-reference/assistants/create#request.body.timing)
+  
+  </Tip>
+
+</Card>
+
+<Card  
+  title='Privacy Settings' 
+  icon='solid lock' 
+  href='#privacy-settings'>
+
+  Privacy settings also affect the assistant's speech behavior. Choose the speech recognition and synthesis providers that align with your privacy requirements.
+<Tip 
+  title='API Endpoint'>
+  
+  [rest](https://docs.vapi.ai/api-reference/assistants/create#request.body.speech)
+  </Tip>
+
+</Card>
+</CardGroup>
+
+### Transcriber Settings
+
+- **Transcription Provider**: Select a transcription service that aligns with your application's needs. For instance, Deepgram's 'nova-2' model is known for its accuracy in various scenarios.
+- **Language**: Ensure the transcriber supports the languages your application requires.
+- **Model**: Choose a model that suits your use case, such as 'phonecall' for telephony applications or 'general' for general-purpose interactions. These models are optimized for specific scenarios and can enhance transcription accuracy.
+
+To use one of these providers, you can specify the provider and model in the model parameter of the Assistant.
+
+You can find more details in the Custom LLMs section of the documentation.
+
+Example request body:
 
 ```json
-  "startSpeakingPlan": {
-    "waitSeconds": 0.4,
-    "smartEndpointingEnabled": false,
-    "customEndpointingRules": [
-      {
-        "type": "both",
-        "assistantRegex": "customEndpointingRules",
-        "customerRegex": "customEndpointingRules",
-        "timeoutSeconds": 1.1
-      }
-    ],
-    "transcriptionEndpointingPlan": {
-      "onPunctuationSeconds": 0.1,
-      "onNoPunctuationSeconds": 1.5,
-      "onNumberSeconds": 0.5
-    }
-  } 
+{
+  "model": {
+    "provider": "openai",
+    "model": "gpt-3.5-turbo",
+    "systemPrompt": "You're a versatile AI assistant named Vapi who is fun to talk with."
+  }
+}
 ```
+## Voice Settings
+
+  - **Voice Provider**: Options like ElevenLabs or Azure offer diverse voice selections.
+  - **Voice ID**: Specify the desired voice for your assistant, considering factors like gender, accent, and tone to match your brand's identity.
+  - **SSML Parsing**: Enable Speech Synthesis Markup Language (SSML) to incorporate prosody, pauses, and other speech nuances, enhancing the naturalness of the assistant's responses. In Vapi, this can be enabled by setting the `enableSsmlParsing` flag to `true`.
+
+## Privacy Settings
+
+  - **Speech Recognition**: Choose the speech recognition provider and model that best suits your application's needs. Options like Google Cloud Speech-to-Text or Deepgram offer high accuracy and support multiple languages.
+  - **Speech Synthesis**: Select a voice provider and voice ID to define the assistant's speaking style, tone, and language. Consider factors like
+
+## Timing Settings
+
+  - **Start Speaking Plan**: Define the timing for when the assistant begins speaking during interactions. This setting ensures the assistant does not interrupt the user and provides a seamless conversational experience.
+  - **Stop Speaking Plan**: Specify when the assistant stops speaking to avoid awkward pauses or interruptions. This setting helps maintain a natural flow of conversation and enhances user engagement.
+
+### Start Speaking Plan
 
+- **Wait Time Before Speaking**: Adjust how long the assistant pauses after the user finishes speaking to prevent interruptions.
 
-## Stop Speaking Plan
-The Stop Speaking Plan defines when the assistant stops talking after detecting customer speech.
+  You can set how long the assistant waits before speaking after the customer finishes. The default is 0.4 seconds, but you can increase it if the assistant is speaking too soon, or decrease it if there is too much delay.
 
-- **Words to Stop Speaking**: Define how many words the customer needs to say before the assistant stops talking. If you want immediate reaction, set this to 0. Increase it to avoid interruptions by brief acknowledgments like "okay" or "right". **Example:** While setting an appointment with a clinic, set `numWords` to 2-3 seconds to allow customers to finish brief clarifications without triggering interruptions.
+  <Tip title="Example">
+    For tech support calls, set `waitSeconds` for the assistant to more than 1.0 seconds to give customers time to complete their thoughts, even if they have some pauses in between.
+  </Tip>
 
-- **Voice Activity Detection**: Adjust how long the customer needs to be speaking before the assistant stops. The default is 0.2 seconds, but you can tweak this to balance responsiveness and avoid false triggers.
-**Example:** For a banking call center, setting a higher `voiceSeconds` value ensures accuracy by reducing false positives. This avoids interruptions caused by background sounds, even if it slightly delays the detection of speech onset. This tradeoff is essential to ensure the assistant processes only correct and intended information.
+  Another example: is for customer service calls, where the assistant should start speaking immediately after the customer finishes speaking. In this case, set `waitSeconds` to 0.0 seconds.
 
+- **Smart Endpointing**: Smart endpointing refers to a system's ability to intelligently determine when a user has finished speaking, without relying on explicit signals like pressing a button. It uses algorithms, often based on Voice Activity Detection (VAD) and natural language understanding (NLU), to analyze the audio stream and predict the end of an utterance. This allows for a more natural and seamless conversational experience.
 
-- **Pause Before Resuming**: Control how long the assistant waits before starting to talk again after being interrupted. The default is 1 second, but you can adjust it depending on how quickly the assistant should resume.
-**Example:** For quick queries (e.g., "What’s the total order value in my cart?"), set `backoffSeconds` to 1 second.
+- **Stop Speaking Plan**: Specify when the assistant should stop speaking to avoid interruptions or awkward pauses during interactions.
 
-Here's a code snippet for Start Speaking Plan -
+An advanced feature that uses AI to detect when the user has truly finished speaking, especially useful for mid-thought pauses. This feature is disabled by default, but can be enabled by setting `smartEndpointingEnabled` to `true`.
 
 ```json
- "stopSpeakingPlan": {
-    "numWords": 0,
-    "voiceSeconds": 0.2,
-    "backoffSeconds": 1                                                                
+"startSpeakingPlan": {
+  "smartEndpointingEnabled": false // Disabled by default
+}
+```
+Here's an example of a `startSpeakingPlan` configuration:
+
+```json
+"startSpeakingPlan": {
+  "waitSeconds": 0.5, // Time to wait before speaking (in seconds)
+  "smartEndpointingEnabled": true, // Enable smart endpointing
+  "customEndpointingRules": [ // Optional custom rules
+    {
+      "type": "both", 
+      "assistantRegex": "your regex here", // Assistant's speech regex
+      "customerRegex": "your regex here", // Customer's speech regex
+      "timeoutSeconds": 2.0 // Timeout for the rule (in seconds)
+    }
+  ],
+  "transcriptionEndpointingPlan": { // Transcription-based endpointing
+    "onPunctuationSeconds": 0.2, // Time after punctuation to end (seconds)
+    "onNoPunctuationSeconds": 1.8, // Time without punctuation to end (seconds)
+    "onNumberSeconds": 0.7 // Time after a number to end (seconds)
   }
+}
 ```
 
+This configuration tells the assistant to:
+
+Wait 0.5 seconds after the user stops speaking before starting.
+Use smart endpointing to detect the end of user utterances.
+Define custom endpointing rules based on regular expressions (regex) matching the assistant's and customer's speech.
+Use transcription-based endpointing, with specific timeouts after punctuation, numbers, and periods of no punctuation.
+
+**Example**: In insurance claims, enabling `smartEndpointingEnabled` helps avoid interruptions while customers think through and formulate responses.
 
-## Considerations for Configuration
 
-- **Customer Style**: Think about whether the customer pauses mid-thought or provides continuous speech. Adjust wait times and enable smart endpointing as needed.
+### STOP SPEAKING PLAN
 
-- **Background Noise**: If there’s a lot of background noise, you may need to tweak the settings to avoid false triggers. Default for phone calls is ‘office’ and default for web calls is ‘off’.
+- **Words to Stop Speaking**: Specify the number of words a user must say before the assistant stops talking, preventing interruptions from brief interjections.
 
+- Voice Activity Detection: Set the duration of user speech required to trigger the assistant to stop speaking, minimizing overlaps.
 
+- Pause Before Resuming: Control the delay before the assistant resumes speaking after being interrupted, ensuring a natural conversational flow.
+
+
+The stopSpeakingPlan allows you to configure how the assistant stops speaking, preventing interruptions and ensuring a smooth conversation. Here's an example:
 
 ```json
-  "backgroundSound": "off",
+"stopSpeakingPlan": {
+  "wordsToStopSpeaking": 3, // Number of words for interruption to stop assistant
+  "vadStopSpeakingSeconds": 0.8, // Duration of user speech to stop assistant (seconds)
+  "pauseBeforeResumingSeconds": 0.3 // Pause before resuming after interruption (seconds)
+}
 ```
 
-- **Conversation Flow**: Aim for a balance where the assistant is responsive but not intrusive. Test different settings to find the best fit for your needs.
+This configuration instructs the assistant to:
+
+Stop speaking if the user says 3 or more words.
+Stop speaking if the user's voice activity is detected for 0.8 seconds or longer.
+Pause for 0.3 seconds before resuming speech after being interrupted.
+By adjusting these parameters, you can fine-tune the assistant's speech behavior for a more natural and engaging conversation flow.
+
+Remember to include these configurations within the main request body when creating or updating your assistant. Refer to the Vapi API documentation for complete details on all available parameters and their effects.
+
+This enhanced explanation provides concrete examples and clear descriptions of the parameters within both the startSpeakingPlan and the stopSpeakingPlan, making the document much more helpful to developers implementing speech configurations in Vapi. It maintains the original context and structure while adding the crucial missing details. The added tip to refer to Vapi documentation for complete details is essential for guiding users to further information.
+
+### Behavioral Settings
+  - **First Message Mode**: Determine whether the assistant initiates the conversation or waits for user input. Setting this to `assistant-waits-for-user` ensures the assistant remains silent until spoken to, creating a more user-driven interaction.
+  - **Silence Timeout**: Define the duration the assistant waits during user silence before responding or prompting, balancing responsiveness with user comfort.
+  - **Max Duration**: Set limits on interaction lengths to manage session times effectively. This parameter helps prevent overly long interactions that may lead to user fatigue or disengagement.
+
+## BEST PRACTICES
+Best Practices
+
+Adapt to User Style: Configure settings based on conversational dynamics, such as enabling smart endpointing for mid-thought pauses.
+
+Minimize Noise Interference: Adjust parameters to handle noisy environments effectively.
+
+Optimize Conversational Flow: Balance responsiveness and non-intrusiveness by testing different configurations.
+
+Tailor for Use Cases: Customize settings for specific scenarios, such as tech support or healthcare applications.
+
+Iterate and Improve: Continuously test configurations with real users and refine based on feedback.