Skip to content

Commit 81ce892

Browse files
authored
Merge pull request #4404 from msakande/image-to-text-models
Draft for image models articles
2 parents ec96182 + a678720 commit 81ce892

File tree

5 files changed

+155
-2
lines changed

5 files changed

+155
-2
lines changed

articles/ai-foundry/concepts/models-featured.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -262,7 +262,7 @@ Mistral AI offers two categories of models, namely:
262262
| [Mistral-Large-2411](https://ai.azure.com/explore/models/Mistral-Large-2411/version/2/registry/azureml-mistral) | [chat-completion](../model-inference/how-to/use-chat-completions.md?context=/azure/ai-foundry/context/context) | - **Input:** text (128,000 tokens) <br /> - **Output:** text (4,096 tokens) <br /> - **Tool calling:** Yes <br /> - **Response formats:** Text, JSON |
263263
| [Mistral-large-2407](https://ai.azure.com/explore/models/Mistral-large-2407/version/1/registry/azureml-mistral) <br /> (deprecated) | [chat-completion](../model-inference/how-to/use-chat-completions.md?context=/azure/ai-foundry/context/context) | - **Input:** text (131,072 tokens) <br /> - **Output:** text (4,096 tokens) <br /> - **Tool calling:** Yes <br /> - **Response formats:** Text, JSON |
264264
| [Mistral-large](https://ai.azure.com/explore/models/Mistral-large/version/1/registry/azureml-mistral) <br /> (deprecated) | [chat-completion](../model-inference/how-to/use-chat-completions.md?context=/azure/ai-foundry/context/context) | - **Input:** text (32,768 tokens) <br /> - **Output:** text (4,096 tokens) <br /> - **Tool calling:** Yes <br /> - **Response formats:** Text, JSON |
265-
| [Mistral-OCR-2503](https://aka.ms/aistudio/landing/mistral-ocr-2503) | image to text | - **Input:** image or PDF pages (1,000 pages, max 50MB PDF file) <br> - **Output:** text <br /> - **Tool calling:** No <br /> - **Response formats:** Text, JSON, Markdown |
265+
| [Mistral-OCR-2503](https://aka.ms/aistudio/landing/mistral-ocr-2503) | [image to text](../how-to/use-image-models.md) | - **Input:** image or PDF pages (1,000 pages, max 50MB PDF file) <br> - **Output:** text <br /> - **Tool calling:** No <br /> - **Response formats:** Text, JSON, Markdown |
266266
| [Mistral-small-2503](https://aka.ms/aistudio/landing/mistral-small-2503) | [chat-completion (with images)](../model-inference/how-to/use-chat-multi-modal.md?context=/azure/ai-foundry/context/context) | - **Input:** text and images (131,072 tokens), <br> image-based tokens are 16px x 16px <br> blocks of the original images <br /> - **Output:** text (4,096 tokens) <br /> - **Tool calling:** Yes <br /> - **Response formats:** Text, JSON |
267267
| [Mistral-small](https://ai.azure.com/explore/models/Mistral-small/version/1/registry/azureml-mistral) | [chat-completion](../model-inference/how-to/use-chat-completions.md?context=/azure/ai-foundry/context/context) | - **Input:** text (32,768 tokens) <br /> - **Output:** text (4,096 tokens) <br /> - **Tool calling:** Yes <br /> - **Response formats:** Text, JSON |
268268

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
---
2+
title: How to use image-to-text models in the model catalog
3+
titleSuffix: Azure AI Foundry
4+
description: Learn how to use image-to-text models from the AI Foundry model catalog.
5+
manager: scottpolly
6+
author: msakande
7+
reviewer: frogglew
8+
ms.service: azure-ai-model-inference
9+
ms.topic: how-to
10+
ms.date: 05/02/2025
11+
ms.author: mopeakande
12+
ms.reviewer: frogglew
13+
ms.custom: references_regions, tool_generated
14+
---
15+
16+
# How to use image-to-text models in the model catalog
17+
18+
This article explains how to use _image-to-text_ models in the AI Foundry model catalog.
19+
20+
Image-to-text models are designed to analyze images and generate descriptive text based on what they see. Think of them as a combination of a camera and a writer. You provide an image as an input to the model, and the model looks at the image and identifies different elements within it, like objects, people, scenes, and even text. Based on its analysis, the model then generates a written description of the image, summarizing what it sees.
21+
22+
Image-to-text models excel at various use cases such as accessibility features, content organization (tagging), creating product and educational visual descriptions, and digitizing content via Optical Character Recognition (OCR). One might say image-to-text models bridge the gap between visual content and written language, making information more accessible and easier to process in various contexts.
23+
24+
## Prerequisites
25+
26+
To use image models in your application, you need:
27+
28+
- An Azure subscription with a valid payment method. Free or trial Azure subscriptions won't work. If you don't have an Azure subscription, create a [paid Azure account](https://azure.microsoft.com/pricing/purchase-options/pay-as-you-go) to begin.
29+
30+
- An [Azure AI Foundry project](create-projects.md).
31+
32+
- An image model deployment on Azure AI Foundry.
33+
34+
- This article uses a __Mistral OCR__ model deployment.
35+
36+
- The endpoint URL and key.
37+
38+
## Use image-to-text model
39+
40+
1. Authenticate using an API key. First, deploy the model to generate the endpoint URL and an API key to authenticate against the service. In this example, the endpoint and key are strings holding the endpoint URL and the API key. The API endpoint URL and API key can be found on the **Deployments + Endpoint** page once the model is deployed.
41+
42+
If you're using Bash:
43+
44+
```bash
45+
export AZURE_API_KEY = "<your-api-key>"
46+
```
47+
48+
If you're in PowerShell:
49+
50+
```powershell
51+
$Env:AZURE_API_KEY = "<your-api-key>"
52+
```
53+
54+
If you're using Windows command prompt:
55+
56+
```
57+
export AZURE_API_KEY = "<your-api-key>"
58+
```
59+
60+
1. Run a basic code sample. Different image models accept different data formats. In this example, _Mistral OCR 25.03_ supports only base64 encoded data; document url or image url isn't supported. Paste the following code into a shell.
61+
62+
```http
63+
curl --request POST \
64+
--url https://<your_serverless_endpoint>/v1/ocr \
65+
--header 'Authorization: <api_key>' \
66+
--header 'Content-Type: Application/json' \
67+
--data '{
68+
"model": "mistral-ocr-2503",
69+
"document": {
70+
"type": "document_url",
71+
"document_name": "test",
72+
"document_url": "data:application/pdf;base64,JVBER... <replace with your base64 encoded image data>"
73+
}
74+
}'
75+
```
76+
77+
## More code samples for Mistral OCR 25.03
78+
79+
To process PDF files:
80+
81+
```bash
82+
# Read the pdf file
83+
input_file_path="assets/2201.04234v3.pdf"
84+
base64_value=$(base64 "$input_file_path")
85+
input_base64_value="data:application/pdf;base64,${base64_value}"
86+
# echo $input_base64_value
87+
88+
# Prepare JSON data
89+
payload_body=$(cat <<EOF
90+
{
91+
"model": "mistral-ocr-2503",
92+
"document": {
93+
"type": "document_url",
94+
"document_url": "$input_base64_value"
95+
},
96+
"include_image_base64": true
97+
}
98+
EOF
99+
)
100+
101+
echo "$payload_body" | curl ${AZURE_AI_CHAT_ENDPOINT}/v1/ocr \
102+
-H "Content-Type: application/json" \
103+
-H "Authorization: Bearer ${AZURE_AI_CHAT_KEY}" \
104+
-d @- -o ocr_pdf_output.json
105+
```
106+
107+
To process an image file:
108+
109+
```bash
110+
# Read the image file
111+
input_file_path="assets/receipt.png"
112+
base64_value=$(base64 "$input_file_path")
113+
input_base64_value="data:application/png;base64,${base64_value}"
114+
# echo $input_base64_value
115+
116+
# Prepare JSON data
117+
payload_body=$(cat <<EOF
118+
{
119+
"model": "mistral-ocr-2503",
120+
"document": {
121+
"type": "image_url",
122+
"image_url": "$input_base64_value"
123+
},
124+
"include_image_base64": true
125+
}
126+
EOF
127+
)
128+
129+
# Process the base64 data with ocr endpoint
130+
echo "$payload_body" | curl ${AZURE_AI_CHAT_ENDPOINT}/v1/ocr \
131+
-H "Content-Type: application/json" \
132+
-H "Authorization: Bearer ${AZURE_AI_CHAT_KEY}" \
133+
-d @- -o ocr_png_output.json
134+
```
135+
136+
## Model-specific parameters
137+
138+
Some image-to-text models only support specific data formats. Mistral OCR 25.03, for example, requires `base64 encoded image data` for their `document_url` parameter. The following table lists the supported and unsupported data formats for image models in the model catalog.
139+
140+
| Model | Supported | Not supported |
141+
| :---- | ----- | ----- |
142+
| Mistral OCR 25.03 | base64 encoded image data | document url, image url |
143+
144+
145+
146+
## Related content
147+
148+
- [How to use image generation models on Azure OpenAI](../../ai-services/openai/how-to/dall-e.md)
149+

articles/ai-foundry/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,8 @@ items:
130130
href: ../ai-foundry/model-inference/how-to/use-chat-reasoning.md?context=/azure/ai-foundry/context/context
131131
- name: Work with multimodal models
132132
href: ../ai-foundry/model-inference/how-to/use-chat-multi-modal.md?context=/azure/ai-foundry/context/context
133+
- name: Work with image models
134+
href: how-to/use-image-models.md
133135
- name: Azure OpenAI and AI services
134136
items:
135137
- name: Use Azure OpenAI Service in Azure AI Foundry portal

articles/machine-learning/concept-models-featured.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -266,7 +266,7 @@ Mistral AI offers two categories of models, namely:
266266
| [Mistral-Large-2411](https://ai.azure.com/explore/models/Mistral-Large-2411/version/2/registry/azureml-mistral) | [chat-completion](../ai-foundry/model-inference/how-to/use-chat-completions.md?context=/azure/machine-learning/context/context) | - **Input:** text (128,000 tokens) <br /> - **Output:** text (4,096 tokens) <br /> - **Tool calling:** Yes <br /> - **Response formats:** Text, JSON |
267267
| [Mistral-large-2407](https://ai.azure.com/explore/models/Mistral-large-2407/version/1/registry/azureml-mistral) <br /> (deprecated) | [chat-completion](../ai-foundry/model-inference/how-to/use-chat-completions.md?context=/azure/machine-learning/context/context) | - **Input:** text (131,072 tokens) <br /> - **Output:** text (4,096 tokens) <br /> - **Tool calling:** Yes <br /> - **Response formats:** Text, JSON |
268268
| [Mistral-large](https://ai.azure.com/explore/models/Mistral-large/version/1/registry/azureml-mistral) <br /> (deprecated) | [chat-completion](../ai-foundry/model-inference/how-to/use-chat-completions.md?context=/azure/machine-learning/context/context) | - **Input:** text (32,768 tokens) <br /> - **Output:** text (4,096 tokens) <br /> - **Tool calling:** Yes <br /> - **Response formats:** Text, JSON |
269-
| [Mistral-OCR-2503](https://aka.ms/aistudio/landing/mistral-ocr-2503) | image to text | - **Input:** image or PDF pages (1,000 pages, max 50MB PDF file) <br> - **Output:** text <br /> - **Tool calling:** No <br /> - **Response formats:** Text, JSON, Markdown |
269+
| [Mistral-OCR-2503](https://aka.ms/aistudio/landing/mistral-ocr-2503) | [image to text](../ai-foundry/how-to/use-image-models.md?context=/azure/machine-learning/context/context) | - **Input:** image or PDF pages (1,000 pages, max 50MB PDF file) <br> - **Output:** text <br /> - **Tool calling:** No <br /> - **Response formats:** Text, JSON, Markdown |
270270
| [Mistral-small-2503](https://aka.ms/aistudio/landing/mistral-small-2503) | [chat-completion (with images)](../ai-foundry/model-inference/how-to/use-chat-multi-modal.md?context=/azure/machine-learning/context/context) | - **Input:** text and images (131,072 tokens), <br> image-based tokens are 16px x 16px <br> blocks of the original images <br /> - **Output:** text (4,096 tokens) <br /> - **Tool calling:** Yes <br /> - **Response formats:** Text, JSON |
271271
| [Mistral-small](https://ai.azure.com/explore/models/Mistral-small/version/1/registry/azureml-mistral) | [chat-completion](../ai-foundry/model-inference/how-to/use-chat-completions.md?context=/azure/machine-learning/context/context) | - **Input:** text (32,768 tokens) <br /> - **Output:** text (4,096 tokens) <br /> - **Tool calling:** Yes <br /> - **Response formats:** Text, JSON |
272272

articles/machine-learning/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -623,6 +623,8 @@ items:
623623
href: ../ai-foundry/model-inference/how-to/use-chat-reasoning.md?context=/azure/machine-learning/context/context
624624
- name: Work with multimodal models
625625
href: ../ai-foundry/model-inference/how-to/use-chat-multi-modal.md?context=/azure/machine-learning/context/context
626+
- name: Work with image models
627+
href: ../ai-foundry/how-to/use-image-models.md?context=/azure/machine-learning/context/context
626628
- name: Built-in policy to allow specific models
627629
href: how-to-built-in-policy-model-deployment.md
628630
- name: Custom policy to allow specific models

0 commit comments

Comments
 (0)