VL TN initial commit

PatrickFarley · PatrickFarley · commit 341e02681613 · 2025-09-29T17:56:18.000-04:00
diff --git a/articles/ai-foundry/responsible-ai/speech-service/voice-live/transparency-note.md b/articles/ai-foundry/responsible-ai/speech-service/voice-live/transparency-note.md
@@ -0,0 +1,116 @@
+---
+title: Use cases for Voice live
+titleSuffix: Azure AI services
+description: This Transparency Note discusses Voice live and the key considerations for making use of this technology responsibly.
+author: PatrickFarley
+ms.author: pafarley
+manager: nitinme
+ms.service: azure-ai-speech
+ms.topic: article
+ms.date: 09/29/2025
+---
+
+# Use cases for Voice live
+
+[!INCLUDE [Non-English translation disclaimer](./includes/non-english-translation.md)]
+
+## What is a Transparency Note? 
+
+An AI system includes not only the technology, but also the people who will use it, the people who will be affected by it, and the environment in which it is deployed. Creating a system that is fit for its intended purpose requires an understanding of how the technology works, what its capabilities and limitations are, and how to achieve the best performance. Microsoft’s Transparency Notes are intended to help you understand how our AI technology works, the choices system owners can make that influence system performance and behavior, and the importance of thinking about the whole system, including the technology, the people, and the environment. You can use Transparency Notes when developing or deploying your own system, or share them with the people who will use or be affected by your system.
+
+Microsoft’s Transparency Notes are part of a broader effort at Microsoft to put our AI Principles into practice. To find out more, see the [Microsoft AI principles](https://www.microsoft.com/ai/responsible-ai).
+
+## The basics of Voice Live API 
+
+# Introduction
+
+The Voice Live API enables developers to build low-latency speech-to-speech experiences. It supports text and audio modalities for both input and output. The API is composed of multiple AI systems, including language models (both large and small), speech to text models, text to speech models, and more. Developers can build conversational experiences to power scenarios including, but not limited to, customer support, education and learning, automotive assistants, and voice-based public services. The API is fully managed and orchestrated, allowing developers to build their end-user experiences without managing underlying models, compute, and bespoke integration of multiple individual components.  
+
+## Key terms 
+
+|Term | Definition |
+|--|--|
+|Transcription | The text output of the speech to text feature. This automatically generated text output uses speech models and is sometimes referred to as machine transcription or automated speech recognition (ASR). Transcription in this context is fully automated, meaning it is generated by the model and, therefore, is different from human transcription, which is text that is generated by human transcribers. |
+|Automatic Speech Recognition (ASR) | Also known as speech to text (STT), ASR is the process whereby a model transcribes or processes human speech as audio into text. |
+|Text to Speech (TTS) |Also known as speech synthesis, TTS is the process whereby a model converts written text into speech audio. |
+|Text to Speech Avatar | Text to Speech Avatar allows developers to input text and create a synthetic video of an avatar speaking, synchronized with audio output from TTS. |
+|Token | Voice Live API processes audio and text by breaking it down into tokens. Tokens can be words or chunks of characters. |
+|Language model | Pretrained or fine-tuned generative AI models that can understand and generate natural language and code. |
+
+## Capabilities
+
+### System behavior
+
+Voice Live API provides developers with choices on multiple dimensions for achieving low-latency speech-to-speech experiences:
+- **Choice of language model**: Developers can choose from a list of different natively supported language models like GPT-Realtime, GPT-5, GPT-4.1, GPT-4o and GPT-4o-mini; incorporate an agent they have built using the Azure AI Foundry Agent Service to give the agent speech-in and speech-out capabilities; or bring their own model of choice deployed in Azure AI Foundry.  
+- **Choice of audio input processing**: Developers can choose between audio input being processed natively through multimodal language models like GPT-Realtime or processed through Azure AI Speech’s speech-to-text capabilities.  
+- **Choice of audio output processing**: Developers can choose between audio output being generated natively through multimodal language models like GPT-Realtime or generated through Azure AI Speech’s text-to-speech capabilities.
+
+For any combination of choices made by the developer, the API also provides the developer the ability to enable conversational enhancement capabilities like start of speech and end of speech detection, background noise suppression, echo cancellation, and more.  
+
+Developers can configure the API with the set of parameters that are most suitable for their scenarios. Then, text and/or audio can be provided as input which gets processed by the developers’ choice of language model, audio input processing mechanism, and audio output processing mechanism to receive text and/or audio output. 
+
+### Use cases 
+
+#### Intended uses
+
+Voice Live API can be used in multiple scenarios where real-time, speech-to-speech experiences are provided to end-users. The system’s intended uses include:
+- **Customer experience**: voice agents for customer support and shopping assistance. The goal for this intended use would be to assist end-users who have questions about products/services from a merchant. For example, a customer who wants to know when their package will be delivered can ask a voice agent, "what is the shipping status of my order?". The voice agent would query tools that it has access to, gather the necessary information, and then respond in audio, "Your order is in enroute and will be delivered within the next 48 hours." 
+- **Automotive**: in-car voice assistant for command & control, general Q&A, etc. The goal for this intended use would be to deliver hands-free functionality to drivers, allowing them to toggle various features of their vehicle, get help with navigating to a destination, etc. For example, a user who wants to turn down the temperature within their car can ask the in-car voice assistant, "Set the temperature to 68 degrees Fahrenheit.". The in-car voice assistant would then invoke the appropriate tool to interface with the vehicle’s control systems to adjust the temperature. It could then respond back to the user in audio, "I’ve set the temperature to 68 degrees. Let me know if there’s anything else I can help with." 
+- **Learning/education**: voice-enabled learning companions and training assistants. The goal for this intended use would be to deliver an interactive assistant who can help end-users learn new concepts across any discipline. For example, a user who wants to practice counting numbers up to ten can ask the learning companion, "Can you help me practice counting numbers from one to ten?". The learning companion could then respond in audio, "Sure! Start by saying the first three numbers and I’ll coach you in case you need any help or make any mistakes." 
+
+#### Considerations when choosing other use cases
+
+We encourage developers to leverage Voice Live API in their innovative solutions or applications. However, here are some considerations when choosing a use case:
+- **Avoid scenarios in which the use or misuse of the system could have a consequential impact on life opportunities or legal status**: Examples include but are not limited to scenarios in which the AI system could affect an individual's legal status, legal rights, or their access to credit, education, employment, healthcare, housing, insurance, social welfare benefits, services, opportunities, or the terms on which these items are available. 
+- **Carefully consider all use cases in high-stakes domains or industries**: Examples include but are not limited to healthcare, education, finance, and legal. 
+- **Legal and regulatory considerations**: Organizations need to evaluate potential specific legal and regulatory obligations when using any AI services and solutions, which may not be appropriate for use in every industry or scenario. Restrictions may vary based on regional or local regulatory requirements. Additionally, AI services or solutions are not designed for and may not be used in ways prohibited in applicable terms of service and relevant codes of conduct.
+
+## Limitations
+
+When it comes to natural language models and speech models, there are fairness and responsible AI issues to consider. People use language to describe the world and to express their beliefs, assumptions, attitudes, and values. As a result, publicly available text and speech data typically used to train natural language processing and speech recognition models contain societal biases relating to race, gender, religion, age, and other groups of people, as well as other undesirable content. Speech models can exhibit varying levels of accuracy across different demographic groups and languages. For example, these societal biases are reflected in the distributions of words, phrases, and syntactic structures. 
+
+### Technical limitations, operational factors, and ranges
+
+Natural language and speech models trained with such data can potentially behave in ways that are unfair, unreliable, or offensive, in turn potentially causing harms. These models have a variety of risks, such as the ability to stereotype, demean, overrepresent or underrepresent different populations, among others. You can find more details about such risks in the [Azure OpenAI Transparency Note](/legal/cognitive-services/openai/transparency-note?tabs=speech). These risks are not mutually exclusive, and a single model can exhibit more than one type of harm, potentially relating to multiple different groups of people. In addition, speech-to-speech experiences, including Voice Live API, may introduce an additional risk of producing potentially inappropriate or offensive content as detailed below. Users should be aware of this risk, as well as the risks that natural language and speech models have overall.
+- **Inappropriate or offensive content**: Language models, including those supported by the Voice Live API, have the potential to produce other types of inappropriate or offensive content. For example, the ability to generate text that is inappropriate in the context of the text prompt, audio output that contains accented speech which may be perceived as offensive in the context, or a mismatched tone in the output like an excited tone in a neutral or somber context.
+
+Find more details about the technical limitations of Azure Speech options in the [Azure Speech to Text Transparency Note](/azure/ai-foundry/responsible-ai/speech-service/speech-to-text/transparency-note#limitations) and [Azure Text to Speech Transparency Note](/azure/ai-foundry/responsible-ai/speech-service/text-to-speech/transparency-note?tabs=prebuilt-voice#limitations).
+
+If you choose to bring your own Foundry Agent to Voice Live, learn the [limitations of the Azure AI Foundry Agent Service](/azure/ai-foundry/responsible-ai/agents/transparency-note#limitations).   
+
+
+## System performance
+
+In many AI systems, performance is often defined in relation to accuracy—that is, how often the AI system offers a correct prediction or output. With natural language models and speech models, two different users might look at the same output and have different opinions of how useful or relevant it is, which means that performance for these systems must be defined more flexibly. Learn more about the performance of Azure OpenAI models and the best practices in the [Azure OpenAI Transparency Note](/azure/ai-foundry/responsible-ai/openai/transparency-note?tabs=text#system-performance).
+
+To learn the best practices for improving speech input and output processing, go to the [Azure Speech to Text Transparency Note](/azure/ai-foundry/responsible-ai/speech-service/speech-to-text/transparency-note#limitations) and [Azure Text to Speech Transparency Note](/azure/ai-foundry/responsible-ai/speech-service/text-to-speech/transparency-note?tabs=prebuilt-voice#limitations).
+
+## Evaluation of Voice Live API
+
+### Evaluating each component
+
+Each component of Voice Live API can be evaluated separately. Learn more about [Evaluation of speech to text](/azure/ai-foundry/responsible-ai/speech-service/speech-to-text/transparency-note#evaluation-of-speech-to-text), [Evaluation of text to speech](/azure/ai-foundry/responsible-ai/speech-service/text-to-speech/transparency-note?tabs=prebuilt-voice#evaluation-of-text-to-speech), and [Evaluation of Azure OpenAI models](/azure/ai-foundry/responsible-ai/openai/transparency-note?tabs=text#evaluating-and-integrating-azure-openai-natural-language-and-vision-models-for-your-use).  
+
+If you choose to bring your own Foundry Agent to Voice Live, learn about the [evaluation of the Azure AI Foundry Agent Service](/azure/ai-foundry/responsible-ai/agents/transparency-note#evaluating-and-integrating-azure-ai-agent-service-for-your-use).
+
+### Evaluating and integrating Voice Live API for your use 
+
+- **Robust ground truth data**: In general, in natural language models, developers should carefully select and pre-process their data to ensure that it is relevant, diverse, and balanced for the intended task and domain. Developers should also check and correct any errors or inconsistencies in the data, such as spelling, grammar, or formatting, to improve the data quality and readability. 
+    Specifically for language model evaluation, the accuracy of the ground truth data provided by the developer is crucial because inaccurate ground truth data leads to meaningless and inaccurate evaluation results. Ensuring the quality and reliability of this data is essential for obtaining valid assessments of the model's performance. Therefore, developers must carefully curate and verify their ground truth data to ensure that the evaluation process accurately reflects the model's true performance. This is particularly important when making decisions about deploying the model in real-world applications. 
+- **Prompt definition for evaluation**: The prompt developers use in their evaluation should match the prompt they plan to use in production. These prompts provide the instructions for the model to follow. Similar to the OpenAI playground, developers can create multiple inputs to include few-shot examples in their prompt. Refer to [Prompt engineering techniques](/en-us/azure/ai-services/openai/concepts/prompt-engineering?tabs=chat) for more details on some advanced techniques in prompt design and prompt engineering.
+- **Diverse metrics**: Use a combination of metrics to capture different aspects of performance such as accuracy, fluency and relevance. 
+- **Human-in-the-loop**: Integrate human feedback alongside automated evaluation to ensure that subjective nuances are accurately captured. For example, when evaluating the quality of audio output, human feedback can help ensure that the tone, speed, intonation, and other subjective metrics are sufficiently accounted for. 
+- **Transparency**: Clearly communicate the evaluation criteria to users, enabling them to understand how decisions are made. 
+- **Continual evaluation and testing**: Continually evaluate the model's performance to identify and address any regressions or negative user experience. 
+
+## Learn more about responsible AI 
+
+- [Microsoft AI principles](/ai/responsible-ai) 
+- [Microsoft responsible AI resources](/ai/responsible-ai-resources) 
+- [Microsoft Azure Learning courses on responsible AI](/learn/paths/responsible-ai-business-principles/) 
+
+## Learn more about Voice Live API 
+
+- [Voice live API overview](/azure/ai-services/speech-service/voice-live) 
+- [How to use the voice live API](/azure/ai-services/speech-service/voice-live-how-to)