Skip to content

Commit 0d057c6

Browse files
Learn Build Service GitHub AppLearn Build Service GitHub App
authored andcommitted
Merging changes synced from https://github.com/MicrosoftDocs/learn-pr (branch live)
2 parents 1589b2b + 959ada5 commit 0d057c6

28 files changed

+794
-0
lines changed
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.develop-speech-agent-speech-mcp.introduction
3+
title: Introduction
4+
metadata:
5+
title: Introduction
6+
description: Introduction to developing a speech agent with the Azure Speech MCP server.
7+
author: ivorb
8+
ms.author: berryivor
9+
ms.date: 03/13/2026
10+
ms.topic: unit
11+
ai-usage: ai-generated
12+
durationInMinutes: 2
13+
content: |
14+
[!include[](includes/01-introduction.md)]
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.develop-speech-agent-speech-mcp.understand-speech-mcp
3+
title: Understand the Azure Speech MCP server
4+
metadata:
5+
title: Understand the Azure Speech MCP server
6+
description: Learn about the Model Context Protocol and the speech capabilities exposed by the Azure Speech MCP server.
7+
author: ivorb
8+
ms.author: berryivor
9+
ms.date: 03/13/2026
10+
ms.topic: unit
11+
ai-usage: ai-generated
12+
durationInMinutes: 7
13+
content: |
14+
[!include[](includes/02-understand-speech-mcp.md)]
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.develop-speech-agent-speech-mcp.connect-use-speech-mcp
3+
title: Connect and use the Speech MCP server with an agent
4+
metadata:
5+
title: Connect and use the Speech MCP server with an agent
6+
description: Learn how to connect the Azure Speech MCP server to an agent in Microsoft Foundry and build a client application.
7+
author: ivorb
8+
ms.author: berryivor
9+
ms.date: 03/13/2026
10+
ms.topic: unit
11+
ai-usage: ai-generated
12+
durationInMinutes: 8
13+
content: |
14+
[!include[](includes/03-connect-use-speech-mcp.md)]
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.develop-speech-agent-speech-mcp.exercise
3+
title: Exercise - Use Azure Speech in an agent
4+
metadata:
5+
title: Exercise - Use Azure Speech in an agent
6+
description: Use the Azure Speech MCP server to create an AI agent that performs speech-to-text and text-to-speech tasks.
7+
author: ivorb
8+
ms.author: berryivor
9+
ms.date: 03/13/2026
10+
ms.topic: unit
11+
ai-usage: ai-generated
12+
durationInMinutes: 30
13+
content: |
14+
[!include[](includes/04-exercise.md)]
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.develop-speech-agent-speech-mcp.knowledge-check
3+
title: Knowledge check
4+
metadata:
5+
title: Knowledge check
6+
description: Check your understanding of the Azure Speech MCP server and agent integration.
7+
author: ivorb
8+
ms.author: berryivor
9+
ms.date: 03/13/2026
10+
ms.topic: unit
11+
ai-usage: ai-generated
12+
durationInMinutes: 3
13+
quiz:
14+
questions:
15+
- content: "What two core capabilities does the Azure Speech MCP server expose to agents?"
16+
choices:
17+
- content: "Language translation and text summarization."
18+
isCorrect: false
19+
explanation: "Incorrect. The Azure Speech MCP server provides speech-to-text and text-to-speech, not translation or summarization."
20+
- content: "Speech-to-text recognition and text-to-speech synthesis."
21+
isCorrect: true
22+
explanation: "Correct. The Azure Speech MCP server exposes speech-to-text (recognize) and text-to-speech (synthesize) as MCP tools."
23+
- content: "Named entity recognition and sentiment analysis."
24+
isCorrect: false
25+
explanation: "Incorrect. Those are Azure Language capabilities, not Azure Speech capabilities."
26+
- content: "Why does the Azure Speech MCP server require an Azure Storage account?"
27+
choices:
28+
- content: "To store the agent's instructions and configuration settings."
29+
isCorrect: false
30+
explanation: "Incorrect. The storage account is used for audio files, not agent configuration."
31+
- content: "To store input audio files and output audio files generated by the speech tools."
32+
isCorrect: true
33+
explanation: "Correct. The Speech MCP server uses Azure Blob Storage to receive audio files for transcription and to save generated audio from text-to-speech."
34+
- content: "To cache the MCP server's tool definitions for faster discovery."
35+
isCorrect: false
36+
explanation: "Incorrect. The storage account is used for audio file input and output, not caching tool definitions."
37+
- content: "What credentials are needed when connecting the Azure Speech MCP server to a Foundry agent?"
38+
choices:
39+
- content: "An OAuth 2.0 token and a managed identity endpoint URL."
40+
isCorrect: false
41+
explanation: "Incorrect. The connection requires a resource key and a blob container SAS URL."
42+
- content: "A Foundry resource key and a SAS URL for a blob container."
43+
isCorrect: true
44+
explanation: "Correct. You provide the resource key in the Ocp-Apim-Subscription-Key field and a SAS URL for the blob container in the X-Blob-Container-Url field."
45+
- content: "A client certificate and the Azure subscription ID."
46+
isCorrect: false
47+
explanation: "Incorrect. The connection uses key-based authentication with a resource key and a SAS URL."
48+
- content: "How can you specify a particular voice when using the text-to-speech tool through the agent?"
49+
choices:
50+
- content: "By configuring the voice in the MCP server settings before connecting."
51+
isCorrect: false
52+
explanation: "Incorrect. You specify the voice in your natural language prompt to the agent."
53+
- content: "By including the voice name in your natural language prompt to the agent."
54+
isCorrect: true
55+
explanation: "Correct. You can specify a voice such as en-GB-SoniaNeural directly in your prompt, and the agent passes it to the text-to-speech tool."
56+
- content: "By setting an environment variable in the client application code."
57+
isCorrect: false
58+
explanation: "Incorrect. Voice selection is specified in the prompt, not through environment variables."
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.develop-speech-agent-speech-mcp.summary
3+
title: Summary
4+
metadata:
5+
title: Summary
6+
description: Summary of developing a speech agent with the Azure Speech MCP server.
7+
author: ivorb
8+
ms.author: berryivor
9+
ms.date: 03/13/2026
10+
ms.topic: unit
11+
ai-usage: ai-generated
12+
durationInMinutes: 2
13+
content: |
14+
[!include[](includes/06-summary.md)]
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
Azure Speech in Foundry Tools provides speech-to-text and text-to-speech capabilities that you can integrate into AI applications. These capabilities let you transcribe audio to text and synthesize natural-sounding speech from text.
2+
3+
While you can call these capabilities directly through the Speech SDK or REST APIs, you can also make them available to an AI agent through the **Azure Speech Model Context Protocol (MCP) server**. This approach lets the agent handle speech tasks based on a user's natural language request, without you needing to write specific code for each speech operation.
4+
5+
For example, suppose you work for a company that needs to process customer support calls. Your team needs to transcribe recorded calls to text for analysis, and generate audio responses that can be played back to customers. Rather than building separate integrations for transcription and synthesis, you can create an AI agent that uses the Azure Speech MCP server to perform both tasks through a single tool connection.
6+
7+
In this module, you learn how the Azure Speech MCP server works, how to connect it to an AI agent in Microsoft Foundry, and how to build a client application that interacts with the agent programmatically.
8+
9+
> [!NOTE]
10+
> The Azure Speech MCP server is currently in public preview. Details described in this module are subject to change.
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
The Azure Speech MCP server connects AI agents to Azure Speech in Foundry Tools through the **Model Context Protocol (MCP)**. Before exploring the Speech MCP server itself, it helps to understand what MCP is and how it enables agents to use external tools.
2+
3+
## What is the Model Context Protocol?
4+
5+
The Model Context Protocol (MCP) is an open protocol that defines how AI agents interact with external tools, data sources, and services. MCP uses a client-server architecture with the following components:
6+
7+
- **Host**: The application that runs the agent (such as Microsoft Foundry or a custom app).
8+
- **Client**: A component within the host that manages connections to MCP servers and handles communication.
9+
- **Server**: A program that exposes tools, resources, and prompts that an agent can discover and call.
10+
11+
When an agent connects to an MCP server, it receives a catalog of available tools along with descriptions of what each tool does. The agent can then choose the right tool based on the user's request. This approach is called *dynamic tool discovery* — the agent doesn't need hardcoded knowledge of each tool. Instead, it queries the MCP server at runtime to find out what's available.
12+
13+
The key advantage of MCP for AI agents is flexibility. Tools can be added, updated, or removed on the server without modifying the agent itself. The agent always has access to the latest tool definitions, which makes MCP-based solutions easier to maintain and scale.
14+
15+
> [!TIP]
16+
> To learn more about MCP architecture and how to build custom MCP tool integrations, see the **[Integrate MCP Tools with Azure AI Agents](/training/modules/connect-agent-to-mcp-tools/)** module.
17+
18+
## Azure Speech MCP server capabilities
19+
20+
The Azure Speech MCP server exposes two core speech capabilities as tools that any MCP-compatible agent can call:
21+
22+
| Capability | Description |
23+
|---|---|
24+
| **Speech-to-text (Recognize)** | Converts audio files to text using advanced speech recognition. Supports WAV, MP3, OGG, FLAC, MP4, M4A, AAC, and other common audio formats. Includes options for language selection, phrase hints for improved accuracy, profanity filtering, and detailed or simple output formats. |
25+
| **Text-to-speech (Synthesize)** | Converts text input into natural-sounding audio files using neural text-to-speech voices. Supports multiple languages and voices (for example, `en-US-JennyNeural` or `en-GB-SoniaNeural`), and generates output in WAV, MP3, or other formats. |
26+
27+
When you connect the Speech MCP server to an agent, the agent receives the available speech tools and their descriptions. Based on the user's prompt, the agent decides which tool to call. For example, if a user says "Transcribe this audio file," the agent calls the speech-to-text tool. If the user says "Generate speech from this text," the agent calls the text-to-speech tool.
28+
29+
## How the agent selects tools
30+
31+
The tool selection process works as follows:
32+
33+
1. The user sends a prompt to the agent.
34+
1. The agent analyzes the prompt and determines which speech task needs to be performed.
35+
1. The agent checks the available MCP tools and their descriptions to find the best match.
36+
1. The agent calls the selected tool through the MCP server, passing the relevant input (audio file URL or text).
37+
1. The MCP server processes the request using Azure Speech and returns the results (transcribed text or a link to an audio file).
38+
1. The agent presents the results to the user in a natural language response.
39+
40+
The agent handles tool selection autonomously, so you don't need to write routing logic to determine whether a prompt requires speech-to-text or text-to-speech.
41+
42+
## Storage requirements
43+
44+
Unlike text-only MCP tools, the Azure Speech MCP server works with audio files, which requires an **Azure Storage account**.
45+
46+
- **Text-to-speech**: The Speech MCP server saves generated audio files to an Azure Blob Storage container. The agent's response includes a link to the generated audio file.
47+
- **Speech-to-text**: The agent can transcribe audio files from a publicly accessible URL or from an Azure Blob Storage container accessed with a SAS URL.
48+
49+
When you connect the Speech MCP server to your agent, you provide a **SAS URL** for a blob container. The SAS URL grants the MCP server permission to read and write files in that container.
50+
51+
> [!IMPORTANT]
52+
> Treat SAS URLs as secrets. Use the shortest practical expiry time, scope them to a single container, and don't embed them in source code, agent prompts, or chat transcripts.
53+
54+
## Prerequisites
55+
56+
To use the Azure Speech MCP server with an agent, you need:
57+
58+
- An **Azure subscription**.
59+
- A **Foundry resource and project** — you need Contributor or Owner role on the resource group. Your Foundry resource includes speech capabilities.
60+
- An **Azure Storage account** with a blob container for storing audio files.
61+
- A **SAS URL** for the blob container with read, write, add, create, and list permissions.
62+
63+
## Security considerations
64+
65+
The Azure Speech MCP server uses key-based authentication. When you create the connection, you provide your resource key and a blob container SAS URL. Follow these best practices:
66+
67+
- Store keys and SAS URLs in a secure secret store and rotate them regularly.
68+
- Avoid embedding keys or SAS URLs directly in source code, scripts, or documentation.
69+
- Use the shortest practical SAS expiry time and scope it to the minimum required resource.
70+
- Rotate keys immediately if you suspect they're exposed.
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
After you understand the capabilities of the Azure Speech MCP server, the next step is to connect it to an agent and start using it. This involves setting up storage, creating an agent in Microsoft Foundry, connecting the Speech MCP tool, testing it in the agent playground, and optionally building a client application.
2+
3+
## Set up Azure Blob Storage
4+
5+
The Azure Speech MCP server requires an Azure Storage account to store audio files. You need to create a storage account and a blob container before connecting the tool.
6+
7+
1. In the [Azure portal](https://portal.azure.com), create a new **Azure Storage account** (or use an existing one).
8+
1. In the storage account, expand **Data storage** and select **Containers**.
9+
1. Create a new container (for example, named **files**) to store the audio files your agent generates and reads.
10+
1. Generate a **SAS token** for the container with the following permissions: Read, Add, Create, Write, and List. Set the expiry time to the shortest practical duration.
11+
12+
> [!IMPORTANT]
13+
> Copy the generated SAS URL and store it securely — you need it when connecting the Speech MCP server.
14+
15+
## Create a Foundry project and agent
16+
17+
To use the Azure Speech MCP server, you need a Microsoft Foundry project with a deployed model.
18+
19+
1. In the [Microsoft Foundry portal](https://ai.azure.com), create a new project (or use an existing one).
20+
1. Deploy a model (such as **gpt-4.1**) that your agent will use for reasoning and generating responses.
21+
1. Create an agent and give it instructions that describe its purpose. For example:
22+
23+
```
24+
You are an AI agent that uses the Azure AI Speech tool to transcribe and generate speech.
25+
```
26+
27+
The agent is now ready to receive tool connections.
28+
29+
## Connect the Azure Speech MCP server
30+
31+
You connect the Azure Speech MCP server to your agent through the **Tools** page in the Foundry portal.
32+
33+
1. In the navigation pane, select the **Tools** page.
34+
1. Select **Connect a tool** and choose **Azure Speech in Foundry Tools** from the catalog.
35+
1. Configure the connection with the following settings:
36+
- **Foundry resource name**: The name of your Foundry resource (for example, `myproject-resource`).
37+
- **Bearer** (`Ocp-Apim-Subscription-Key`): The key for your Foundry project.
38+
- **X-Blob-Container-Url**: The SAS URL for your blob container.
39+
40+
1. Wait for the connection to be created, then select **Use in an agent** and choose your agent.
41+
42+
:::image type="content" source="../media/azure-speech-tool-config.png" alt-text="Screenshot of the Tools catalog in the Foundry portal showing the Azure Speech in Foundry Tools connection configuration.":::
43+
44+
The agent now has access to the speech-to-text and text-to-speech tools exposed by the Azure Speech MCP server.
45+
46+
> [!TIP]
47+
> You can find the project key on the project home page in the Foundry portal.
48+
49+
## Test in the agent playground
50+
51+
The agent playground in the Foundry portal provides an interactive environment for testing your agent.
52+
53+
### Test text-to-speech
54+
55+
Enter a prompt that asks the agent to generate speech:
56+
57+
```
58+
Generate "To be or not to be, that is the question." as speech
59+
```
60+
61+
The first time the agent uses the Speech MCP tool, you're prompted to **approve** the tool usage. You can select **Always approve all Azure Speech MCP Server tools** to skip future approval prompts.
62+
63+
The response includes a link to the generated audio file saved in your blob container. Select the link to listen to the synthesized speech.
64+
65+
### Test speech-to-text
66+
67+
Enter a prompt that asks the agent to transcribe an audio file. You can use a publicly accessible URL or a SAS URL pointing to a file in your blob container:
68+
69+
```
70+
Transcribe the file at https://example.com/audio/meeting-recording.wav
71+
```
72+
73+
The agent calls the speech-to-text tool and returns the transcribed text.
74+
75+
### Customizing speech output
76+
77+
The Speech MCP tools support several options you can specify in your prompts:
78+
79+
- **Voice selection**: Specify a neural voice, such as `en-GB-SoniaNeural` or `en-US-JennyNeural`.
80+
- **Language**: Specify the language for recognition or synthesis (for example, `es-ES` for Spanish).
81+
- **Phrase hints**: Provide domain-specific terms to improve transcription accuracy (for example, "Azure, OpenAI, Cognitive Services").
82+
- **Profanity filtering**: Request `masked`, `removed`, or `raw` profanity handling during transcription.
83+
84+
For example:
85+
86+
```
87+
Synthesize "Better a witty fool, than a foolish wit!" as speech using the voice "en-GB-SoniaNeural".
88+
```
89+
90+
## Build a client application
91+
92+
While the agent playground is useful for testing, you typically want to build a client application that uses the agent programmatically. The Microsoft Foundry SDK supports this through the OpenAI Responses API.
93+
94+
To build a client application, you use the `azure-ai-projects` and `azure-identity` packages. The general pattern is:
95+
96+
1. Create an `AIProjectClient` using your Foundry project endpoint and `DefaultAzureCredential` (which uses your Azure CLI credentials in development).
97+
1. Get an OpenAI client from the project client by calling `get_openai_client()`.
98+
1. Call `responses.create()` to send a user prompt to the agent.
99+
100+
The key part is how you reference the agent — you specify it by name in the `extra_body` parameter:
101+
102+
```python
103+
response = openai_client.responses.create(
104+
input=[{"role": "user", "content": user_prompt}],
105+
extra_body={
106+
"agent_reference": {
107+
"name": "Speech-Agent",
108+
"type": "agent_reference"
109+
}
110+
},
111+
)
112+
113+
print(response.output_text)
114+
```
115+
116+
The agent processes the prompt, calls the appropriate Speech MCP tool, and returns the result in `output_text`. For text-to-speech requests, the output includes a link to the generated audio file in your blob container.
117+
118+
### Connect the MCP server in code
119+
120+
Instead of connecting the Azure Speech MCP server through the Foundry portal, you can define the MCP tool connection directly in code when you create an agent. Use the `MCPTool` class from the `azure-ai-projects` SDK:
121+
122+
```python
123+
from azure.ai.projects.models import MCPTool
124+
125+
mcp_tool = MCPTool(
126+
server_label="azure-speech",
127+
server_url="https://{foundry-resource-name}.cognitiveservices.azure.com/speech/mcp",
128+
require_approval="always",
129+
)
130+
```
131+
132+
You then pass the `mcp_tool` when creating the agent through the SDK. This approach is useful when you want to manage tool connections as part of your application code rather than configuring them manually in the portal.
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
Now it's your turn to build a speech agent using the Azure Speech MCP server!
2+
3+
In this exercise, you create an AI agent in Microsoft Foundry, connect it to the Azure Speech MCP server, test text-to-speech and speech-to-text capabilities in the agent playground, and build a Python client application that interacts with the agent.
4+
5+
> [!NOTE]
6+
> To complete this exercise, you need an **[Azure subscription](https://azure.microsoft.com/pricing/purchase-options/azure-account?cid=msft_learn)** in which you have administrative access.
7+
8+
Launch the exercise and follow the instructions.
9+
10+
[![Button to launch exercise.](../media/launch-exercise.png)](https://go.microsoft.com/fwlink/?linkid=2356519&azure-portal=true)

0 commit comments

Comments
 (0)