Skip to content

Commit cab2a06

Browse files
authored
Merge pull request #50321 from MicrosoftDocs/NEW-develop-generative-ai-audio-apps
Modules/M09-develop-generative-ai-audio-apps
2 parents 0fc35ea + a87ca84 commit cab2a06

File tree

16 files changed

+234
-0
lines changed

16 files changed

+234
-0
lines changed

learn-pr/paths/develop-language-solutions-azure-ai/index.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,5 +34,6 @@ modules:
3434
- learn.wwl.translate-text-with-translator-service
3535
- learn.wwl.create-speech-enabled-apps
3636
- learn.wwl.translate-speech-speech-service
37+
- learn.wwl.develop-generative-ai-audio-apps
3738
trophy:
3839
uid: learn.wwl.develop-language-solutions-azure-ai.trophy
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.develop-generative-ai-audio-apps.introduction
3+
title: Introduction
4+
metadata:
5+
title: Introduction
6+
description: "Get started with audio-enabled generative AI models."
7+
ms.date: 05/6/2025
8+
author: buzahid
9+
ms.author: buzahid
10+
ms.topic: unit
11+
durationInMinutes: 1
12+
content: |
13+
[!include[](includes/1-introduction.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.develop-generative-ai-audio-apps.deploy-multimodal-models
3+
title: Deploy a multimodal model
4+
metadata:
5+
title: Deploy a multimodal model
6+
description: "Deploy a multimodal model that can respond to audio-based prompts."
7+
ms.date: 05/6/2025
8+
author: buzahid
9+
ms.author: buzahid
10+
ms.topic: unit
11+
durationInMinutes: 3
12+
content: |
13+
[!include[](includes/2-deploy-multimodal-model.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.develop-generative-ai-audio-apps.develop-audio-chat-apps
3+
title: Develop an audio-based chat app
4+
metadata:
5+
title: Develop an audio-based chat app
6+
description: "Use Azure AI Foundry, Azure AI Model Inference, and Azure OpenAI SDKs to develop an audio-based chat app."
7+
ms.date: 05/6/2025
8+
author: buzahid
9+
ms.author: buzahid
10+
ms.topic: unit
11+
durationInMinutes: 5
12+
content: |
13+
[!include[](includes/3-develop-audio-chat-app.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.develop-generative-ai-audio-apps.exercise
3+
title: Exercise - Develop an audio-enabled chat app
4+
metadata:
5+
title: Exercise - Develop an audio-enabled chat app
6+
description: "Get practical experience of deploying a multimodal model and creating an audio-enabled chat app."
7+
ms.date: 05/6/2025
8+
author: buzahid
9+
ms.author: buzahid
10+
ms.topic: unit
11+
durationInMinutes: 30
12+
content: |
13+
[!include[](includes/4-exercise.md)]
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.develop-generative-ai-audio-apps.knowledge-check
3+
title: Module assessment
4+
metadata:
5+
title: Module assessment
6+
description: "Check your learning on audio-enabled generative AI."
7+
ms.date: 05/6/2025
8+
author: buzahid
9+
ms.author: buzahid
10+
ms.topic: unit
11+
durationInMinutes: 3
12+
content: |
13+
quiz:
14+
questions:
15+
- content: "Which kind of model can you use to respond to audio input?"
16+
choices:
17+
- content: "Only OpenAI GPT models"
18+
isCorrect: false
19+
explanation: "Incorrect."
20+
- content: "Embedding models"
21+
isCorrect: false
22+
explanation: Incorrect."
23+
- content: "Multimodal models"
24+
isCorrect: true
25+
explanation: "Correct."
26+
- content: "How can you submit a prompt that asks a model to analyze an audio file?"
27+
choices:
28+
- content: "Submit one prompt with an audio-based message followed by another prompt with a text-based message."
29+
isCorrect: false
30+
explanation: "Incorrect."
31+
- content: "Submit a prompt that contains a multi-part user message, containing both text content and audio content."
32+
isCorrect: true
33+
explanation: "Correct."
34+
- content: "Submit the audio file as the system message and the instruction or question as the user message."
35+
isCorrect: false
36+
explanation: "Incorrect."
37+
- content: "How can you include an audio in a message?"
38+
choices:
39+
- content: "As a URL or as binary data"
40+
isCorrect: true
41+
explanation: "Correct."
42+
- content: "Only as a URL"
43+
isCorrect: false
44+
explanation: "Incorrect."
45+
- content: "Only as binary data"
46+
isCorrect: false
47+
explanation: "Incorrect."
48+
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.develop-generative-ai-audio-apps.summary
3+
title: Summary
4+
metadata:
5+
title: Summary
6+
description: "Reflect on what you've learned about audio-enabled generative AI models."
7+
ms.date: 05/6/2025
8+
author: buzahid
9+
ms.author: buzahid
10+
ms.topic: unit
11+
durationInMinutes: 1
12+
content: |
13+
[!include[](includes/6-summary.md)]
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
Generative AI models make it possible to build intelligent chat-based applications that can understand and reason over input. Traditionally, text input is the primary mode of interaction with AI models, but multimodal models are increasingly becoming available. These models make it possible for chat applications to respond to audio input as well as text.
2+
3+
In this module, we'll discuss audio-enabled generative AI and explore how you can use Azure AI Foundry to create generative AI solutions that respond to prompts that include a mix of text and audio data.
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
To handle prompts that include audio, you need to deploy a *multimodal* generative AI model - in other words, a model that supports not only text-based input, but audio-based input as well. Multimodal models available in Azure AI Foundry include (among others):
2+
3+
- Microsoft **Phi-4-multimodal-instruct**
4+
- OpenAI **gpt-4o**
5+
- OpenAI **gpt-4o-mini**
6+
7+
> [!TIP]
8+
> To learn more about available models in Azure AI Foundry, see the **[Model catalog and collections in Azure AI Foundry portal](/azure/ai-foundry/how-to/model-catalog-overview)** article in the Azure AI Foundry documentation.
9+
10+
## Testing multimodal models with audio-based prompts
11+
12+
After deploying a multimodal model, you can test it in the chat playground in Azure AI Foundry portal. Some models allow you to include audio attachments in the playground, either by uploading a file or recording a message.
13+
14+
![Screenshot of the chat playground with an audio-based prompt.](../media/audio-prompt.png)
15+
16+
In the chat playground, you can upload a local audio file and add text to the message to elicit a response from a multimodal model.
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
To develop a client app that engages in audio-based chats with a multimodal model, you can use the same basic techniques used for text-based chats. You require a connection to the endpoint where the model is deployed, and you use that endpoint to submit prompts that consists of messages to the model and process the responses.
2+
3+
The key difference is that prompts for an audio-based chat include multi-part user messages that contain both a *text* content item and an *audio* content item.
4+
5+
![Diagram of a multi-part prompt being submitted to a model.](../media/multi-part-prompt.png)
6+
7+
The JSON representation of a prompt that includes a multi-part user message looks something like this:
8+
9+
```json
10+
{
11+
"messages": [
12+
{ "role": "system", "content": "You are a helpful assistant." },
13+
{ "role": "user", "content": [
14+
{
15+
"type": "text",
16+
"text": "Transcribe this audio:"
17+
},
18+
{
19+
"type": "audio_url",
20+
"audio_url": {
21+
"url": "https://....."
22+
}
23+
}
24+
] }
25+
]
26+
}
27+
```
28+
29+
The audio content item can be:
30+
31+
- A URL to an audio file in a web site.
32+
- Binary audio data
33+
34+
When using binary data to submit a local audio file, the **audio_url** content takes the form of a base64 encoded value in a data URL format:
35+
36+
```json
37+
{
38+
"type": "audio_url",
39+
"audio_url": {
40+
"url": "data:audio/mp3;base64,<binary_audio_data>"
41+
}
42+
}
43+
```
44+
45+
Depending on the model type, and where you deployed it, you can use Microsoft Azure AI Model Inference or OpenAI APIs to submit audio-based prompts. These libraries also provide language-specific SDKs that abstract the underlying REST APIs.
46+
47+
In the exercise that follows in this module, you can use the Python or .NET SDK for the Azure AI Model Inference API and the OpenAI API to develop an audio-enabled chat application.

0 commit comments

Comments
 (0)