Skip to content

Commit 593bb4f

Browse files
authored
Merge pull request #3531 from santiagxf/santiagxf/audio-js
Audio support for JS
2 parents d72eda4 + b0e2a8c commit 593bb4f

File tree

5 files changed

+259
-19
lines changed

5 files changed

+259
-19
lines changed

articles/ai-foundry/model-inference/includes/how-to-prerequisites-javascript.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,13 @@ author: santiagxf
1212
```bash
1313
npm install @azure-rest/ai-inference
1414
```
15+
16+
* Import the following modules:
17+
18+
```javascript
19+
import ModelClient, { isUnexpected } from "@azure-rest/ai-inference";
20+
import { AzureKeyCredential } from "@azure/core-auth";
21+
import { DefaultAzureCredential } from "@azure/identity";
22+
import { createRestError } from "@azure-rest/core-client";
23+
```
24+

articles/ai-foundry/model-inference/includes/use-chat-multi-modal/csharp.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -62,9 +62,6 @@ client = new ChatCompletionsClient(
6262

6363
Some models can reason across text and images and generate text completions based on both kinds of input. In this section, you explore the capabilities of Some models for vision in a chat fashion:
6464

65-
> [!IMPORTANT]
66-
> Some models support only one image for each turn in the chat conversation and only the last image is retained in context. If you add multiple images, it results in an error.
67-
6865
To see this capability, download an image and encode the information as `base64` string. The resulting data should be inside of a [data URL](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs):
6966

7067

@@ -126,6 +123,9 @@ Usage:
126123

127124
Images are broken into tokens and submitted to the model for processing. When referring to images, each of those tokens is typically referred as *patches*. Each model may break down a given image on a different number of patches. Read the model card to learn the details.
128125

126+
> [!IMPORTANT]
127+
> Some models support only one image for each turn in the chat conversation and only the last image is retained in context. If you add multiple images, it results in an error.
128+
129129
## Use chat completions with audio
130130

131131
Some models can reason across text and audio inputs. The following example shows how you can send audio context to chat completions models that also supports audio. Use `InputAudio` to load the content of the audio file into the payload. The content is encoded in `base64` data and sent over the payload.
@@ -157,12 +157,12 @@ Console.WriteLine($"\tCompletion tokens: {response.Value.Usage.CompletionTokens}
157157
```
158158

159159
```console
160-
ASSISTANT: The chart illustrates that larger models tend to perform better in quality, as indicated by their size in billions of parameters. However, there are exceptions to this trend, such as Phi-3-medium and Phi-3-small, which outperform smaller models in quality. This suggests that while larger models generally have an advantage, there might be other factors at play that influence a model's performance.
161-
Model: Phi-4-multimodal-instruct
162-
Usage:
163-
Prompt tokens: 2380
164-
Completion tokens: 126
165-
Total tokens: 2506
160+
ASSISTANT: Hola. ¿Cómo estás?
161+
Model: speech
162+
Usage:
163+
Prompt tokens: 77
164+
Completion tokens: 7
165+
Total tokens: 84
166166
```
167167

168168
The model can read the content from an **accessible cloud location** by passing the URL as an input. The Python SDK doesn't provide a direct way to do it, but you can indicate the payload as follows:
@@ -194,12 +194,12 @@ Console.WriteLine($"\tCompletion tokens: {response.Value.Usage.CompletionTokens}
194194
```
195195

196196
```console
197-
ASSISTANT: The chart illustrates that larger models tend to perform better in quality, as indicated by their size in billions of parameters. However, there are exceptions to this trend, such as Phi-3-medium and Phi-3-small, which outperform smaller models in quality. This suggests that while larger models generally have an advantage, there might be other factors at play that influence a model's performance.
198-
Model: Phi-4-multimodal-instruct
199-
Usage:
200-
Prompt tokens: 2380
201-
Completion tokens: 126
202-
Total tokens: 2506
197+
ASSISTANT: Hola. ¿Cómo estás?
198+
Model: speech
199+
Usage:
200+
Prompt tokens: 77
201+
Completion tokens: 7
202+
Total tokens: 84
203203
```
204204

205-
205+
Audio is broken into tokens and submitted to the model for processing. Some models may operate directly over audio tokens while other may use internal modules to perform speech-to-text, resulting in different strategies to compute tokens. Read the model card for details about how each model operates.

articles/ai-foundry/model-inference/includes/use-chat-multi-modal/javascript.md

Lines changed: 228 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,5 +14,231 @@ ms.custom: references_regions, tool_generated
1414
zone_pivot_groups: azure-ai-inference-samples
1515
---
1616

17-
> [!NOTE]
18-
> Using audio inputs is only supported using Python, C#, or REST requests.
17+
[!INCLUDE [Feature preview](~/reusable-content/ce-skilling/azure/includes/ai-studio/includes/feature-preview.md)]
18+
19+
This article explains how to use chat completions API with models deployed to Azure AI model inference in Azure AI services.
20+
21+
## Prerequisites
22+
23+
To use chat completion models in your application, you need:
24+
25+
[!INCLUDE [how-to-prerequisites](../how-to-prerequisites.md)]
26+
27+
[!INCLUDE [how-to-prerequisites-javascript](../how-to-prerequisites-javascript.md)]
28+
29+
* A chat completions model deployment with support for **audio and images**. If you don't have one read [Add and configure models to Azure AI services](../../how-to/create-model-deployments.md) to add a chat completions model to your resource.
30+
31+
* This tutorial uses `Phi-4-multimodal-instruct`.
32+
33+
## Use chat completions
34+
35+
First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables.
36+
37+
```javascript
38+
const client = new ModelClient(
39+
"https://<resource>.services.ai.azure.com/models",
40+
new AzureKeyCredential(process.env.AZUREAI_ENDPOINT_KEY)
41+
);
42+
```
43+
44+
If you have configured the resource to with **Microsoft Entra ID** support, you can use the following code snippet to create a client.
45+
46+
```javascript
47+
const clientOptions = { credentials: { "https://cognitiveservices.azure.com" } };
48+
49+
const client = new ModelClient(
50+
"https://<resource>.services.ai.azure.com/models",
51+
new DefaultAzureCredential(),
52+
clientOptions,
53+
);
54+
```
55+
56+
## Use chat completions with images
57+
58+
Some models can reason across text and images and generate text completions based on both kinds of input. In this section, you explore the capabilities of some models for vision in a chat fashion.
59+
60+
To see this capability, download an image and encode the information as `base64` string. The resulting data should be inside of a [data URL](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs):
61+
62+
```javascript
63+
const image_url = "https://news.microsoft.com/source/wp-content/uploads/2024/04/The-Phi-3-small-language-models-with-big-potential-1-1900x1069.jpg";
64+
const image_format = "jpeg";
65+
66+
const response = await fetch(image_url, { headers: { "User-Agent": "Mozilla/5.0" } });
67+
const image_data = await response.arrayBuffer();
68+
const image_data_base64 = Buffer.from(image_data).toString("base64");
69+
const data_url = `data:image/${image_format};base64,${image_data_base64}`;
70+
```
71+
72+
Visualize the image:
73+
74+
75+
```javascript
76+
const img = document.createElement("img");
77+
img.src = data_url;
78+
document.body.appendChild(img);
79+
```
80+
81+
:::image type="content" source="../../../../ai-foundry/media/how-to/sdks/small-language-models-chart-example.jpg" alt-text="A chart displaying the relative capabilities between large language models and small language models." lightbox="../../../../ai-foundry/media/how-to/sdks/small-language-models-chart-example.jpg":::
82+
83+
Now, create a chat completion request with the image:
84+
85+
86+
```javascript
87+
var messages = [
88+
{ role: "system", content: "You are a helpful assistant that can generate responses based on images." },
89+
{ role: "user", content:
90+
[
91+
{ type: "text", text: "Which conclusion can be extracted from the following chart?" },
92+
{ type: "image_url", image:
93+
{
94+
url: data_url
95+
}
96+
}
97+
]
98+
}
99+
];
100+
101+
var response = await client.path("/chat/completions").post({
102+
body: {
103+
messages: messages,
104+
model: "Phi-4-multimodal-instruct",
105+
}
106+
});
107+
```
108+
109+
The response is as follows, where you can see the model's usage statistics:
110+
111+
112+
```javascript
113+
console.log(response.body.choices[0].message.role + ": " + response.body.choices[0].message.content);
114+
console.log("Model:", response.body.model);
115+
console.log("Usage:");
116+
console.log("\tPrompt tokens:", response.body.usage.prompt_tokens);
117+
console.log("\tCompletion tokens:", response.body.usage.completion_tokens);
118+
console.log("\tTotal tokens:", response.body.usage.total_tokens);
119+
```
120+
121+
```console
122+
ASSISTANT: The chart illustrates that larger models tend to perform better in quality, as indicated by their size in billions of parameters. However, there are exceptions to this trend, such as Phi-3-medium and Phi-3-small, which outperform smaller models in quality. This suggests that while larger models generally have an advantage, there might be other factors at play that influence a model's performance.
123+
Model: Phi-4-multimodal-instruct
124+
Usage:
125+
Prompt tokens: 2380
126+
Completion tokens: 126
127+
Total tokens: 2506
128+
```
129+
130+
Images are broken into tokens and submitted to the model for processing. When referring to images, each of those tokens is typically referred as *patches*. Each model may break down a given image on a different number of patches. Read the model card to learn the details.
131+
132+
> [!IMPORTANT]
133+
> Some models support only one image for each turn in the chat conversation and only the last image is retained in context. If you add multiple images, it results in an error.
134+
135+
## Use chat completions with audio
136+
137+
Some models can reason across text and audio inputs. The following example shows how you can send audio context to chat completions models that also supports audio.
138+
139+
In this example, we create a function `getAudioData` to load the content of the audio file encoded in `base64` data as the model expects it.
140+
141+
```javascript
142+
import fs from "node:fs";
143+
144+
/**
145+
* Get the Base 64 data of an audio file.
146+
* @param {string} audioFile - The path to the image file.
147+
* @returns {string} Base64 data of the audio.
148+
*/
149+
function getAudioData(audioFile: string): string {
150+
try {
151+
const audioBuffer = fs.readFileSync(audioFile);
152+
return audioBuffer.toString("base64");
153+
} catch (error) {
154+
console.error(`Could not read '${audioFile}'.`);
155+
console.error("Set the correct path to the audio file before running this sample.");
156+
process.exit(1);
157+
}
158+
}
159+
```
160+
161+
Let's now use this function to load the content of an audio file stored on disk. We send the content of the audio file in a user message. Notice that in the request we also indicate the format of the audio content:
162+
163+
```javascript
164+
const audioFilePath = "hello_how_are_you.mp3"
165+
const audioFormat = "mp3"
166+
const audioData = getAudioData(audioFilePath);
167+
168+
const systemMessage = { role: "system", content: "You are an AI assistant for translating and transcribing audio clips." };
169+
const audioMessage = {
170+
role: "user",
171+
content: [
172+
{ type: "text", text: "Translate this audio snippet to spanish."},
173+
{ type: "input_audio",
174+
input_audio: {
175+
audioData,
176+
audioFormat,
177+
},
178+
},
179+
]
180+
};
181+
182+
const response = await client.path("/chat/completions").post({
183+
body: {
184+
messages: [
185+
systemMessage,
186+
audioMessage
187+
],
188+
model: "Phi-4-multimodal-instruct",
189+
},
190+
});
191+
```
192+
193+
The response is as follows, where you can see the model's usage statistics:
194+
195+
```javascript
196+
if (isUnexpected(response)) {
197+
throw response.body.error;
198+
}
199+
200+
console.log("Response: ", response.body.choices[0].message.content);
201+
console.log("Model: ", response.body.model);
202+
console.log("Usage:");
203+
console.log("\tPrompt tokens:", response.body.usage.prompt_tokens);
204+
console.log("\tTotal tokens:", response.body.usage.total_tokens);
205+
console.log("\tCompletion tokens:", response.body.usage.completion_tokens);
206+
```
207+
208+
```console
209+
ASSISTANT: Hola. ¿Cómo estás?
210+
Model: speech
211+
Usage:
212+
Prompt tokens: 77
213+
Completion tokens: 7
214+
Total tokens: 84
215+
```
216+
217+
The model can read the content from an **accessible cloud location** by passing the URL as an input. The Python SDK doesn't provide a direct way to do it, but you can indicate the payload as follows:
218+
219+
```javascript
220+
const systemMessage = { role: "system", content: "You are a helpful assistant." };
221+
const audioMessage = {
222+
role: "user",
223+
content: [
224+
{ type: "text", text: "Transcribe this audio."},
225+
{ type: "audio_url",
226+
audio_url: {
227+
url: "https://example.com/audio.mp3",
228+
},
229+
},
230+
]
231+
};
232+
233+
const response = await client.path("/chat/completions").post({
234+
body: {
235+
messages: [
236+
systemMessage,
237+
audioMessage
238+
],
239+
model: "Phi-4-multimodal-instruct",
240+
},
241+
});
242+
```
243+
244+
Audio is broken into tokens and submitted to the model for processing. Some models may operate directly over audio tokens while other may use internal modules to perform speech-to-text, resulting in different strategies to compute tokens. Read the model card for details about how each model operates.

articles/ai-foundry/model-inference/includes/use-chat-multi-modal/python.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -213,3 +213,5 @@ response = client.complete(
213213
}
214214
)
215215
```
216+
217+
Audio is broken into tokens and submitted to the model for processing. Some models may operate directly over audio tokens while other may use internal modules to perform speech-to-text, resulting in different strategies to compute tokens. Read the model card for details about how each model operates.

articles/ai-foundry/model-inference/includes/use-chat-multi-modal/rest.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -240,4 +240,6 @@ The response is as follows, where you can see the model's usage statistics:
240240
"total_tokens": 84
241241
}
242242
}
243-
```
243+
```
244+
245+
Audio is broken into tokens and submitted to the model for processing. Some models may operate directly over audio tokens while other may use internal modules to perform speech-to-text, resulting in different strategies to compute tokens. Read the model card for details about how each model operates.

0 commit comments

Comments
 (0)