You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-foundry/model-inference/includes/use-chat-multi-modal/csharp.md
+16-16Lines changed: 16 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -62,9 +62,6 @@ client = new ChatCompletionsClient(
62
62
63
63
Some models can reason across text and images and generate text completions based on both kinds of input. In this section, you explore the capabilities of Some models for vision in a chat fashion:
64
64
65
-
> [!IMPORTANT]
66
-
> Some models support only one image for each turn in the chat conversation and only the last image is retained in context. If you add multiple images, it results in an error.
67
-
68
65
To see this capability, download an image and encode the information as `base64` string. The resulting data should be inside of a [data URL](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs):
69
66
70
67
@@ -126,6 +123,9 @@ Usage:
126
123
127
124
Images are broken into tokens and submitted to the model for processing. When referring to images, each of those tokens is typically referred as *patches*. Each model may break down a given image on a different number of patches. Read the model card to learn the details.
128
125
126
+
> [!IMPORTANT]
127
+
> Some models support only one image for each turn in the chat conversation and only the last image is retained in context. If you add multiple images, it results in an error.
128
+
129
129
## Use chat completions with audio
130
130
131
131
Some models can reason across text and audio inputs. The following example shows how you can send audio context to chat completions models that also supports audio. Use `InputAudio` to load the content of the audio file into the payload. The content is encoded in `base64` data and sent over the payload.
ASSISTANT: Thechartillustratesthatlargermodelstendtoperformbetterinquality, asindicatedbytheirsizeinbillionsofparameters. However, thereareexceptionstothistrend, suchasPhi-3-mediumandPhi-3-small, whichoutperformsmallermodelsinquality. Thissuggeststhatwhilelargermodelsgenerallyhaveanadvantage, theremightbeotherfactorsatplaythatinfluenceamodel's performance.
161
-
Model: Phi-4-multimodal-instruct
162
-
Usage:
163
-
Prompttokens:2380
164
-
Completiontokens:126
165
-
Totaltokens:2506
160
+
ASSISTANT: Hola. ¿Cómoestás?
161
+
Model:speech
162
+
Usage:
163
+
Prompttokens:77
164
+
Completiontokens:7
165
+
Totaltokens:84
166
166
```
167
167
168
168
Themodelcanreadthecontentfroman**accessiblecloudlocation**bypassingtheURLasaninput. ThePythonSDKdoesn't provide a direct way to do it, but you can indicate the payload as follows:
Copy file name to clipboardExpand all lines: articles/ai-foundry/model-inference/includes/use-chat-multi-modal/javascript.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -127,6 +127,8 @@ Usage:
127
127
Total tokens: 2506
128
128
```
129
129
130
+
Images are broken into tokens and submitted to the model for processing. When referring to images, each of those tokens is typically referred as *patches*. Each model may break down a given image on a different number of patches. Read the model card to learn the details.
131
+
130
132
> [!IMPORTANT]
131
133
> Some models support only one image for each turn in the chat conversation and only the last image is retained in context. If you add multiple images, it results in an error.
Audio is broken into tokens and submitted to the model for processing. Some models may operate directly over audio tokens while other may use internal modules to perform speech-to-text, resulting in different strategies to compute tokens. Read the model card for details about how each model operates.
Copy file name to clipboardExpand all lines: articles/ai-foundry/model-inference/includes/use-chat-multi-modal/python.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -213,3 +213,5 @@ response = client.complete(
213
213
}
214
214
)
215
215
```
216
+
217
+
Audio is broken into tokens and submitted to the model for processing. Some models may operate directly over audio tokens while other may use internal modules to perform speech-to-text, resulting in different strategies to compute tokens. Read the model card for details about how each model operates.
Copy file name to clipboardExpand all lines: articles/ai-foundry/model-inference/includes/use-chat-multi-modal/rest.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -240,4 +240,6 @@ The response is as follows, where you can see the model's usage statistics:
240
240
"total_tokens": 84
241
241
}
242
242
}
243
-
```
243
+
```
244
+
245
+
Audio is broken into tokens and submitted to the model for processing. Some models may operate directly over audio tokens while other may use internal modules to perform speech-to-text, resulting in different strategies to compute tokens. Read the model card for details about how each model operates.
0 commit comments