Skip to content

Commit 4e6103e

Browse files
authored
Merge pull request #262851 from MicrosoftDocs/main
1/10 11:00 AM IST Publish
2 parents b421088 + 8458166 commit 4e6103e

File tree

58 files changed

+523
-192
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

58 files changed

+523
-192
lines changed
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
---
2+
title: GPT-4 Turbo with Vision concepts
3+
titleSuffix: Azure OpenAI
4+
description: Learn about vision chats enabled by GPT-4 Turbo with Vision.
5+
author: PatrickFarley
6+
ms.author: pafarley
7+
ms.service: azure-ai-openai
8+
ms.topic: conceptual
9+
ms.date: 01/02/2024
10+
manager: nitinme
11+
keywords:
12+
---
13+
14+
# GPT-4 Turbo with Vision concepts
15+
16+
GPT-4 Turbo with Vision is a large multimodal model (LMM) developed by OpenAI that can analyze images and provide textual responses to questions about them. It incorporates both natural language processing and visual understanding. This guide provides details on the capabilities and limitations of GPT-4 Turbo with Vision.
17+
18+
To try out GPT-4 Turbo with Vision, see the [quickstart](/azure/ai-services/openai/gpt-v-quickstart).
19+
20+
## Chats with vision
21+
22+
The GPT-4 Turbo with Vision model answers general questions about what's present in the images or videos you upload.
23+
24+
25+
## Enhancements
26+
27+
Enhancements let you incorporate other Azure AI services (such as Azure AI Vision) to add new functionality to the chat-with-vision experience.
28+
29+
**Object grounding**: Azure AI Vision complements GPT-4 Turbo with Vision’s text response by identifying and locating salient objects in the input images. This lets the chat model give more accurate and detailed responses about the contents of the image.
30+
31+
:::image type="content" source="../media/concepts/gpt-v/object-grounding.png" alt-text="Screenshot of an image with object grounding applied. Objects have bounding boxes with labels.":::
32+
33+
:::image type="content" source="../media/concepts/gpt-v/object-grounding-response.png" alt-text="Screenshot of a chat response to an image prompt about an outfit. The response is an itemized list of clothing items seen in the image.":::
34+
35+
**Optical Character Recognition (OCR)**: Azure AI Vision complements GPT-4 Turbo with Vision by providing high-quality OCR results as supplementary information to the chat model. It allows the model to produce higher quality responses for images with dense text, transformed images, and numbers-heavy financial documents, and increases the variety of languages the model can recognize in text.
36+
37+
:::image type="content" source="../media/concepts/gpt-v/receipts.png" alt-text="Photo of several receipts.":::
38+
39+
:::image type="content" source="../media/concepts/gpt-v/ocr-response.png" alt-text="Screenshot of the JSON response of an OCR call.":::
40+
41+
**Video prompt**: The **video prompt** enhancement lets you use video clips as input for AI chat, enabling the model to generate summaries and answers about video content. It uses Azure AI Vision Video Retrieval to sample a set of frames from a video and create a transcript of the speech in the video.
42+
43+
In order to use the video prompt enhancement, you need both an Azure AI Vision resource and an Azure Video Indexer resource, in addition to your Azure OpenAI resource.
44+
45+
> [!VIDEO https://www.microsoft.com/en-us/videoplayer/embed/RW1eHRf]
46+
47+
48+
## Special pricing information
49+
50+
> [!IMPORTANT]
51+
> Pricing details are subject to change in the future.
52+
53+
GPT-4 Turbo with Vision accrues charges like other Azure OpenAI chat models. You pay a per-token rate for the prompts and completions, detailed on the [Pricing page](/pricing/details/cognitive-services/openai-service/). The base charges and additional features are outlined here:
54+
55+
Base Pricing for GPT-4 Turbo with Vision is:
56+
- Input: $0.01 per 1000 tokens
57+
- Output: $0.03 per 1000 tokens
58+
59+
See the [Tokens section of the overview](/azure/ai-services/openai/overview#tokens) for information on how text and images translate to tokens.
60+
61+
Additionally, if you use video prompt integration with the Video Retrieval add-on, it accrues other costs:
62+
- Ingestion: $0.05 per minute of video
63+
- Transactions: $0.25 per 1000 queries of the Video Retrieval index
64+
65+
Processing videos involves the use of extra tokens to identify key frames for analysis. The number of these additional tokens will be roughly equivalent to the sum of the tokens in the text input, plus 700 tokens.
66+
67+
### Example price calculation
68+
69+
> [!IMPORTANT]
70+
> The following content is an example only, and prices are subject to change in the future.
71+
72+
For a typical use case, take a 3-minute video with a 100-token prompt input. The video has a transcript that's 100 tokens long, and when the service processes the prompt, it generates 100 tokens of output. The pricing for this transaction would be:
73+
74+
| Item | Detail | Total Cost |
75+
|-----------------|-----------------|--------------|
76+
| GPT-4 Turbo with Vision input tokens | 100 text tokens | $0.001 |
77+
| Additional Cost to identify frames | 100 input tokens + 700 tokens + 1 Video Retrieval transaction | $0.00825 |
78+
| Image Inputs and Transcript Input | 20 images (85 tokens each) + 100 transcript tokens | $0.018 |
79+
| Output Tokens | 100 tokens (assumed) | $0.003 |
80+
| **Total Cost** | | **$0.03025** |
81+
82+
Additionally, there's a one-time indexing cost of $0.15 to generate the Video Retrieval index for this 3-minute video. This index can be reused across any number of Video Retrieval and GPT-4 Turbo with Vision API calls.
83+
84+
## Limitations
85+
86+
This section describes the limitations of GPT-4 Turbo with Vision.
87+
88+
### Image support
89+
90+
- **Limitation on image enhancements per chat session**: Enhancements cannot be applied to multiple images within a single chat call.
91+
- **Maximum input image size**: The maximum size for input images is restricted to 20 MB.
92+
- **Object grounding in enhancement API**: When the enhancement API is used for object grounding, and the model detects duplicates of an object, it will generate one bounding box and label for all the duplicates instead of separate ones for each.
93+
- **Low resolution accuracy**: When images are analyzed using the "low resolution" setting, it allows for faster responses and uses fewer input tokens for certain use cases. However, this could impact the accuracy of object and text recognition within the image.
94+
- **Image chat restriction**: When you upload images in Azure OpenAI Studio or the API, there is a limit of 10 images per chat call.
95+
96+
### Video support
97+
98+
- **Low resolution**: Video frames are analyzed using GPT-4 Turbo with Vision's "low resolution" setting, which may affect the accuracy of small object and text recognition in the video.
99+
- **Video file limits**: Both MP4 and MOV file types are supported. In Azure OpenAI Studio, videos must be less than 3 minutes long. When you use the API there is no such limitation.
100+
- **Prompt limits**: Video prompts only contain one video and no images. In Azure OpenAI Studio, you can clear the session to try another video or images.
101+
- **Limited frame selection**: The service selects 20 frames from the entire video, which might not capture all the critical moments or details. Frame selection can be approximately evenly spread through the video or focused by a specific video retrieval query, depending on the prompt.
102+
- **Language support**: The service primarily supports English for grounding with transcripts. Transcripts don't provide accurate information on lyrics in songs.
103+
104+
## Next steps
105+
106+
- Get started using GPT-4 Turbo with Vision by following the [quickstart](/azure/ai-services/openai/gpt-v-quickstart).
107+
- For a more in-depth look at the APIs, and to use video prompts in chat, follow the [how-to guide](../how-to/gpt-with-vision.md).
108+
- See the [completions and embeddings API reference](../reference.md)

articles/ai-services/openai/faq.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -92,13 +92,13 @@ sections:
9292
To learn more about how GPT models are trained and work we recommend watching [Andrej Karpathy's talk from Build 2023 on the state of GPT](https://www.youtube.com/watch?v=bZQun8Y4L2A).
9393
9494
- question: |
95-
I asked the model when it's knowledge cutoff is and it gave me a different answer than what is on the Azure OpenAI model's page. Why does this happen?
95+
I asked the model when its knowledge cutoff is and it gave me a different answer than what is on the Azure OpenAI model's page. Why does this happen?
9696
answer:
9797
This is expected behavior. The models aren't able to answer questions about themselves. If you want to know when the knowledge cutoff for the model's training data is, consult the [models page](./concepts/models.md).
9898
- question: |
9999
I asked the model a question about something that happened recently before the knowledge cutoff and it got the answer wrong. Why does this happen?
100100
answer: |
101-
This is expected behavior. First there's no guarantee that every recent event that has occurred was part of the model's training data. And even when information was part of the training data, without using additional techniques like Retrieval Augmented Generation (RAG) to help ground the model's responses there's always a chance of ungrounded responses occurring. Both Azure OpenAI's [use your data feature](./concepts/use-your-data.md) and [Bing Chat](https://www.microsoft.com/edge/features/bing-chat?form=MT00D8) use Azure OpenAI models combined with Retrieval Augmented Generation to help further ground model responses.
101+
This is expected behavior. First there's no guarantee that every recent event was part of the model's training data. And even when information was part of the training data, without using additional techniques like Retrieval Augmented Generation (RAG) to help ground the model's responses there's always a chance of ungrounded responses occurring. Both Azure OpenAI's [use your data feature](./concepts/use-your-data.md) and [Bing Chat](https://www.microsoft.com/edge/features/bing-chat?form=MT00D8) use Azure OpenAI models combined with Retrieval Augmented Generation to help further ground model responses.
102102
103103
The frequency that a given piece of information appeared in the training data can also impact the likelihood that the model will respond in a certain way.
104104
@@ -215,7 +215,7 @@ sections:
215215
- question: |
216216
What are the known limitations of GPT-4 Turbo with Vision?
217217
answer: |
218-
See the [limitations](./how-to/gpt-with-vision.md#limitations) section of the GPT-4 Turbo with Vision how-to guide.
218+
See the [limitations](./concepts/gpt-with-vision.md#limitations) section of the GPT-4 Turbo with Vision concepts guide.
219219
220220
- name: Web app
221221
questions:

articles/ai-services/openai/how-to/gpt-with-vision.md

Lines changed: 3 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -175,7 +175,7 @@ Send a POST request to `https://{RESOURCE_NAME}.openai.azure.com/openai/deployme
175175

176176
The format is similar to that of the chat completions API for GPT-4, but the message content can be an array containing strings and images (either a valid HTTP or HTTPS URL to an image, or a base-64-encoded image).
177177

178-
You must also include the `enhancements` and `dataSources` objects. `enhancements` represents the specific Vision enhancement features requested in the chat. It has a `grounding` and `ocr` property, which each have a boolean `enabled` property. Use these to request the OCR service and/or the object detection/grounding service. `dataSources` represents the Computer Vision resource data that's needed for Vision enhancement. It has a `type` property which should be `"AzureComputerVision"` and a `parameters` property. Set the `endpoint` and `key` to the endpoint URL and access key of your Computer Vision resource. Remember to set a `"max_tokens"` value, or the return output will be cut off.
178+
You must also include the `enhancements` and `dataSources` objects. `enhancements` represents the specific Vision enhancement features requested in the chat. It has a `grounding` and `ocr` property, which both have a boolean `enabled` property. Use these to request the OCR service and/or the object detection/grounding service. `dataSources` represents the Computer Vision resource data that's needed for Vision enhancement. It has a `type` property which should be `"AzureComputerVision"` and a `parameters` property. Set the `endpoint` and `key` to the endpoint URL and access key of your Computer Vision resource. Remember to set a `"max_tokens"` value, or the return output will be cut off.
179179

180180
```json
181181
{
@@ -287,7 +287,7 @@ GPT-4 Turbo with Vision provides exclusive access to Azure AI Services tailored
287287
> To use Vision enhancement, you need a Computer Vision resource, and it must be in the same Azure region as your GPT-4 Turbo with Vision resource.
288288
289289
> [!CAUTION]
290-
> Azure AI enhancements for GPT-4 Turbo with Vision will be billed separately from the core functionalities. Each specific Azure AI enhancement for GPT-4 Turbo with Vision has its own distinct charges.
290+
> Azure AI enhancements for GPT-4 Turbo with Vision will be billed separately from the core functionalities. Each specific Azure AI enhancement for GPT-4 Turbo with Vision has its own distinct charges. For details, see the [special pricing information](../concepts/gpt-with-vision.md#special-pricing-information).
291291
292292
Follow these steps to set up a video retrieval system and integrate it with your AI chat model:
293293
1. Get an Azure AI Vision resource in the same region as the Azure OpenAI resource you're using.
@@ -400,40 +400,7 @@ Base Pricing for GPT-4 Turbo with Vision is:
400400

401401
Video prompt integration with Video Retrieval Add-on:
402402
- Ingestion: $0.05 per minute of video
403-
- Transactions: $0.25 per 1000 queries of the Video Retrieval index
404-
405-
Processing videos will involve the use of extra tokens to identify key frames for analysis. The number of these additional tokens will be roughly equivalent to the sum of the tokens in the text input plus 700 tokens.
406-
407-
#### Calculation
408-
For a typical use case let's imagine that I have use a 3-minute video with a 100-token prompt input. The section of video has a transcript that's 100-tokens long and when I process the prompt, I generate 100-tokens of output. The pricing for this transaction would be as follows:
409-
410-
| Item | Detail | Total Cost |
411-
|-------------------------------------------|---------------------------------------------------------------|--------------|
412-
| GPT-4 Turbo with Vision Input Tokens | 100 text tokens | $0.001 |
413-
| Additional Cost to identify frames | 100 input tokens + 700 tokens + 1 Video Retrieval txn | $0.00825 |
414-
| Image Inputs and Transcript Input | 20 images (85 tokens each) + 100 transcript tokens | $0.018 |
415-
| Output Tokens | 100 tokens (assumed) | $0.003 |
416-
| **Total Cost** | | **$0.03025** |
417-
418-
Additionally, there's a one-time indexing cost of $0.15 to generate the Video Retrieval index for this 3-minute segment of video. This index can be reused across any number of Video Retrieval and GPT-4 Turbo with Vision calls.
419-
420-
## Limitations
421-
422-
### Image support
423-
424-
- **Limitation on image enhancements per chat session**: Enhancements cannot be applied to multiple images within a single chat call.
425-
- **Maximum input image size**: The maximum size for input images is restricted to 20 MB.
426-
- **Object grounding in enhancement API**: When the enhancement API is used for object grounding, and the model detects duplicates of an object, it will generate one bounding box and label for all the duplicates instead of separate ones for each.
427-
- **Low resolution accuracy**: When images are analyzed using the "low resolution" setting, it allows for faster responses and uses fewer input tokens for certain use cases. However, this could impact the accuracy of object and text recognition within the image.
428-
- **Image chat restriction**: When uploading images in the chat playground or the API, there is a limit of 10 images per chat call.
429-
430-
### Video support
431-
432-
- **Low resolution**: Video frames are analyzed using GPT-4 Turbo with Vision's "low resolution" setting, which may affect the accuracy of small object and text recognition in the video.
433-
- **Video file limits**: Both MP4 and MOV file types are supported. In the Azure AI Playground, videos must be less than 3 minutes long. When you use the API there is no such limitation.
434-
- **Prompt limits**: Video prompts only contain one video and no images. In Playground, you can clear the session to try another video or images.
435-
- **Limited frame selection**: The service selects 20 frames from the entire video, which might not capture all the critical moments or details. Frame selection can be approximately evenly spread through the video or focused by a specific video retrieval query, depending on the prompt.
436-
- **Language support**: The service primarily supports English for grounding with transcripts. Transcripts don't provide accurate information on lyrics in songs.
403+
- Transactions: $0.25 per 1000 queries of the Video Retrieval indexer
437404

438405
## Next steps
439406

articles/ai-services/openai/how-to/switching-endpoints.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ We recommend using environment variables. If you haven't done this before our [P
3434
from openai import OpenAI
3535

3636
client = OpenAI(
37-
api_key=os.environ['OPENAI_API_KEY']
37+
api_key=os.environ["OPENAI_API_KEY"]
3838
)
3939

4040

@@ -74,7 +74,7 @@ client = AzureOpenAI(
7474
from openai import OpenAI
7575

7676
client = OpenAI(
77-
api_key=os.environ['OPENAI_API_KEY']
77+
api_key=os.environ["OPENAI_API_KEY"]
7878
)
7979

8080

@@ -111,7 +111,10 @@ client = AzureOpenAI(
111111

112112
## Keyword argument for model
113113

114-
OpenAI uses the `model` keyword argument to specify what model to use. Azure OpenAI has the concept of unique model [deployments](create-resource.md?pivots=web-portal#deploy-a-model). When using Azure OpenAI `model` should refer to the underling deployment name you chose when you deployed the model.
114+
OpenAI uses the `model` keyword argument to specify what model to use. Azure OpenAI has the concept of unique model [deployments](create-resource.md?pivots=web-portal#deploy-a-model). When using Azure OpenAI `model` should refer to the underlying deployment name you chose when you deployed the model.
115+
116+
> [!IMPORTANT]
117+
> When you access the model via the API in Azure OpenAI you will need to refer to the deployment name rather than the underlying model name in API calls. This is one of the [key differences](../how-to/switching-endpoints.md) between OpenAI and Azure OpenAI. OpenAI only requires the model name, Azure OpenAI always requires deployment name, even when using the model parameter. In our docs we often have examples where deployment names are represented as identical to model names to help indicate which model works with a particular API endpoint. Ultimately your deployment names can follow whatever naming convention is best for your use case.
115118
116119
<table>
117120
<tr>
@@ -122,7 +125,7 @@ OpenAI uses the `model` keyword argument to specify what model to use. Azure Ope
122125

123126
```python
124127
completion = client.completions.create(
125-
model='gpt-3.5-turbo-instruct',
128+
model="gpt-3.5-turbo-instruct",
126129
prompt="<prompt>")
127130
)
128131

@@ -142,7 +145,7 @@ embedding = client.embeddings.create(
142145

143146
```python
144147
completion = client.completions.create(
145-
model=gpt-35-turbo-instruct, # This must match the custom deployment name you chose for your model.
148+
model="gpt-35-turbo-instruct", # This must match the custom deployment name you chose for your model.
146149
prompt=<"prompt">
147150
)
148151

0 commit comments

Comments
 (0)