Merge pull request #50321 from MicrosoftDocs/NEW-develop-generative-ai-audio-apps

AnnaMHuff · web-flow · commit cab2a06e235a · 2025-05-07T16:52:11.000-06:00
Modules/M09-develop-generative-ai-audio-apps
diff --git a/learn-pr/paths/develop-language-solutions-azure-ai/index.yml b/learn-pr/paths/develop-language-solutions-azure-ai/index.yml
@@ -34,5 +34,6 @@ modules:
 - learn.wwl.translate-text-with-translator-service
 - learn.wwl.create-speech-enabled-apps
 - learn.wwl.translate-speech-speech-service
+- learn.wwl.develop-generative-ai-audio-apps
 trophy:
   uid: learn.wwl.develop-language-solutions-azure-ai.trophy
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/1-introduction.yml b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/1-introduction.yml
@@ -0,0 +1,13 @@
+### YamlMime:ModuleUnit
+uid: learn.wwl.develop-generative-ai-audio-apps.introduction
+title: Introduction
+metadata:
+  title: Introduction
+  description: "Get started with audio-enabled generative AI models."
+  ms.date: 05/6/2025
+  author: buzahid
+  ms.author: buzahid
+  ms.topic: unit
+durationInMinutes: 1
+content: |
+  [!include[](includes/1-introduction.md)]
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/2-deploy-multimodal-model.yml b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/2-deploy-multimodal-model.yml
@@ -0,0 +1,13 @@
+### YamlMime:ModuleUnit
+uid: learn.wwl.develop-generative-ai-audio-apps.deploy-multimodal-models
+title: Deploy a multimodal model
+metadata:
+  title: Deploy a multimodal model
+  description: "Deploy a multimodal model that can respond to audio-based prompts."
+  ms.date: 05/6/2025
+  author: buzahid
+  ms.author: buzahid
+  ms.topic: unit
+durationInMinutes: 3
+content: |
+  [!include[](includes/2-deploy-multimodal-model.md)]
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/3-develop-audio-chat-app.yml b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/3-develop-audio-chat-app.yml
@@ -0,0 +1,13 @@
+### YamlMime:ModuleUnit
+uid: learn.wwl.develop-generative-ai-audio-apps.develop-audio-chat-apps
+title: Develop an audio-based chat app
+metadata:
+  title: Develop an audio-based chat app
+  description: "Use Azure AI Foundry, Azure AI Model Inference, and Azure OpenAI SDKs to develop an audio-based chat app."
+  ms.date: 05/6/2025
+  author: buzahid
+  ms.author: buzahid
+  ms.topic: unit
+durationInMinutes: 5
+content: |
+  [!include[](includes/3-develop-audio-chat-app.md)]
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/4-exercise.yml b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/4-exercise.yml
@@ -0,0 +1,13 @@
+### YamlMime:ModuleUnit
+uid: learn.wwl.develop-generative-ai-audio-apps.exercise
+title: Exercise - Develop an audio-enabled chat app
+metadata:
+  title: Exercise - Develop an audio-enabled chat app
+  description: "Get practical experience of deploying a multimodal model and creating an audio-enabled chat app."
+  ms.date: 05/6/2025
+  author: buzahid
+  ms.author: buzahid
+  ms.topic: unit
+durationInMinutes: 30
+content: |
+  [!include[](includes/4-exercise.md)]
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/5-knowledge-check.yml b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/5-knowledge-check.yml
@@ -0,0 +1,48 @@
+### YamlMime:ModuleUnit
+uid: learn.wwl.develop-generative-ai-audio-apps.knowledge-check
+title: Module assessment
+metadata:
+  title: Module assessment
+  description: "Check your learning on audio-enabled generative AI."
+  ms.date: 05/6/2025
+  author: buzahid
+  ms.author: buzahid
+  ms.topic: unit
+durationInMinutes: 3
+content: |
+quiz:
+  questions:
+  - content: "Which kind of model can you use to respond to audio input?"
+    choices:
+    - content: "Only OpenAI GPT models"
+      isCorrect: false
+      explanation: "Incorrect."
+    - content: "Embedding models"
+      isCorrect: false
+      explanation: Incorrect."
+    - content: "Multimodal models"
+      isCorrect: true
+      explanation: "Correct."
+  - content: "How can you submit a prompt that asks a model to analyze an audio file?"
+    choices:
+    - content: "Submit one prompt with an audio-based message followed by another prompt with a text-based message."
+      isCorrect: false
+      explanation: "Incorrect."
+    - content: "Submit a prompt that contains a multi-part user message, containing both text content and audio content."
+      isCorrect: true
+      explanation: "Correct."
+    - content: "Submit the audio file as the system message and the instruction or question as the user message."
+      isCorrect: false
+      explanation: "Incorrect."
+  - content: "How can you include an audio in a message?"
+    choices:
+    - content: "As a URL or as binary data"
+      isCorrect: true
+      explanation: "Correct."
+    - content: "Only as a URL"
+      isCorrect: false
+      explanation: "Incorrect."
+    - content: "Only as binary data"
+      isCorrect: false
+      explanation: "Incorrect."
+
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/6-summary.yml b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/6-summary.yml
@@ -0,0 +1,13 @@
+### YamlMime:ModuleUnit
+uid: learn.wwl.develop-generative-ai-audio-apps.summary
+title: Summary
+metadata:
+  title: Summary
+  description: "Reflect on what you've learned about audio-enabled generative AI models."
+  ms.date: 05/6/2025
+  author: buzahid
+  ms.author: buzahid
+  ms.topic: unit
+durationInMinutes: 1
+content: |
+  [!include[](includes/6-summary.md)]
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/1-introduction.md b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/1-introduction.md
@@ -0,0 +1,3 @@
+Generative AI models make it possible to build intelligent chat-based applications that can understand and reason over input. Traditionally, text input is the primary mode of interaction with AI models, but multimodal models are increasingly becoming available. These models make it possible for chat applications to respond to audio input as well as text.
+
+In this module, we'll discuss audio-enabled generative AI and explore how you can use Azure AI Foundry to create generative AI solutions that respond to prompts that include a mix of text and audio data.
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/2-deploy-multimodal-model.md b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/2-deploy-multimodal-model.md
@@ -0,0 +1,16 @@
+To handle prompts that include audio, you need to deploy a *multimodal* generative AI model - in other words, a model that supports not only text-based input, but audio-based input as well. Multimodal models available in Azure AI Foundry include (among others):
+
+- Microsoft **Phi-4-multimodal-instruct**
+- OpenAI **gpt-4o**
+- OpenAI **gpt-4o-mini**
+
+> [!TIP]
+> To learn more about available models in Azure AI Foundry, see the **[Model catalog and collections in Azure AI Foundry portal](/azure/ai-foundry/how-to/model-catalog-overview)** article in the Azure AI Foundry documentation.
+
+## Testing multimodal models with audio-based prompts
+
+After deploying a multimodal model, you can test it in the chat playground in Azure AI Foundry portal. Some models allow you to include audio attachments in the playground, either by uploading a file or recording a message.
+
+![Screenshot of the chat playground with an audio-based prompt.](../media/audio-prompt.png)
+
+In the chat playground, you can upload a local audio file and add text to the message to elicit a response from a multimodal model.
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/3-develop-audio-chat-app.md b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/3-develop-audio-chat-app.md
@@ -0,0 +1,47 @@
+To develop a client app that engages in audio-based chats with a multimodal model, you can use the same basic techniques used for text-based chats. You require a connection to the endpoint where the model is deployed, and you use that endpoint to submit prompts that consists of messages to the model and process the responses.
+
+The key difference is that prompts for an audio-based chat include multi-part user messages that contain both a *text* content item and an *audio* content item.
+
+![Diagram of a multi-part prompt being submitted to a model.](../media/multi-part-prompt.png)
+
+The JSON representation of a prompt that includes a multi-part user message looks something like this:
+
+```json
+{ 
+    "messages": [ 
+        { "role": "system", "content": "You are a helpful assistant." }, 
+        { "role": "user", "content": [  
+            { 
+                "type": "text", 
+                "text": "Transcribe this audio:" 
+            },
+            { 
+                "type": "audio_url",
+                "audio_url": {
+                    "url": "https://....."
+                }
+            }
+        ] } 
+    ]
+} 
+```
+
+The audio content item can be:
+
+- A URL to an audio file in a web site.
+- Binary audio data
+
+When using binary data to submit a local audio file, the **audio_url** content takes the form of a base64 encoded value in a data URL format:
+
+```json
+{
+    "type": "audio_url",
+    "audio_url": {
+       "url": "data:audio/mp3;base64,<binary_audio_data>"
+    }
+}
+```
+
+Depending on the model type, and where you deployed it, you can use Microsoft Azure AI Model Inference or OpenAI APIs to submit audio-based prompts. These libraries also provide language-specific SDKs that abstract the underlying REST APIs.
+
+In the exercise that follows in this module, you can use the Python or .NET SDK for the Azure AI Model Inference API and the OpenAI API to develop an audio-enabled chat application.
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/4-exercise.md b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/4-exercise.md
@@ -0,0 +1,8 @@
+If you have an Azure subscription, you can complete this exercise to develop an audio-enabled chat app.
+
+> [!NOTE]
+> If you don't have an Azure subscription, you can [sign up for an account](https://azure.microsoft.com/free?azure-portal=true), which includes credits for the first 30 days.
+
+Launch the exercise and follow the instructions.
+
+[![Button to launch exercise.](../media/launch-exercise.png)](https://go.microsoft.com/fwlink/?linkid=2320123&azure-portal=true)
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/6-summary.md b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/6-summary.md
@@ -0,0 +1,6 @@
+In this module, you learned about audio-enabled generative AI models and how to implement chat solutions that include audio-based input.
+
+Audio-enabled models let you create AI solutions that can understand audio and respond to related questions or instructions. Beyond just identifying spoken words, some models can also use reasoning based on what they hear. For instance, they can summarize a message or assess the speaker's sentiment.
+
+> [!TIP]
+> For more information about working with multimodal models in Azure AI Foundry, see **[How to use image and audio in chat completions with Azure AI model inference](/azure/ai-foundry/model-inference/how-to/use-chat-multi-modal)** and **[Quickstart: Use speech and audio in your AI chats](/azure/ai-services/openai/realtime-audio-quickstart)**.
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/index.yml b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/index.yml
@@ -0,0 +1,40 @@
+### YamlMime:Module
+uid: learn.wwl.develop-generative-ai-audio-apps
+metadata:
+  title: Develop an audio-enabled generative AI application
+  description: A voice carries meaning beyond words, and audio-enabled generative AI models can interpret spoken input to understand tone, intent, and language. Learn how to build audio-enabled chat apps that listen and respond to audio.
+  ms.date: 05/6/2025
+  author: buzahid
+  ms.author: buzahid
+  ms.topic: module-standard-task-based # Please don't edit, used for our analytics
+  ms.service: azure-ai-services
+title: Develop an audio-enabled generative AI application
+summary: A voice carries meaning beyond words, and audio-enabled generative AI models can interpret spoken input to understand tone, intent, and language. Learn how to build audio-enabled chat apps that listen and respond to audio.
+abstract: |
+  After completing this module, you'll be able to:
+  - Deploy an audio-enabled generative AI model in Azure AI Foundry.
+  - Create a chat app that submits audio-based prompts.
+prerequisites: |
+  Before starting this module, you should have:
+  - Experience with deploying generative AI models in Azure AI Foundry.
+  - Programming experience with Python or Microsoft C#.
+iconUrl: /learn/achievements/generic-badge.svg
+levels:
+  - intermediate
+roles: 
+  - ai-engineer
+products: 
+  - azure
+  - ai-services
+subjects: 
+  - artificial-intelligence
+units:
+- learn.wwl.develop-generative-ai-audio-apps.introduction
+- learn.wwl.develop-generative-ai-audio-apps.deploy-multimodal-models
+- learn.wwl.develop-generative-ai-audio-apps.develop-audio-chat-apps
+- learn.wwl.develop-generative-ai-audio-apps.exercise
+- learn.wwl.develop-generative-ai-audio-apps.knowledge-check
+- learn.wwl.develop-generative-ai-audio-apps.summary
+badge:
+  uid: learn.wwl.develop-generative-ai-audio-apps.badge
+
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/media/audio-prompt.png b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/media/audio-prompt.png
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/media/launch-exercise.png b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/media/launch-exercise.png
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/media/multi-part-prompt.png b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/media/multi-part-prompt.png

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+Generative AI models make it possible to build intelligent chat-based applications that can understand and reason over input. Traditionally, text input is the primary mode of interaction with AI models, but multimodal models are increasingly becoming available. These models make it possible for chat applications to respond to audio input as well as text.`
	`2`	`+`
	`3`	`+In this module, we'll discuss audio-enabled generative AI and explore how you can use Azure AI Foundry to create generative AI solutions that respond to prompts that include a mix of text and audio data.`