Skip to content

Commit 750e417

Browse files
committed
add concepts article
1 parent 74a2ceb commit 750e417

File tree

7 files changed

+97
-45
lines changed

7 files changed

+97
-45
lines changed
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
---
2+
title: GPT-4 Turbo with Vision concepts
3+
titleSuffix: Azure OpenAI
4+
description: Learn about vision chats enabled by GPT-4 Turbo with Vision.
5+
author: PatrickFarley
6+
ms.author: pafarley
7+
ms.service: azure-ai-openai
8+
ms.topic: conceptual
9+
ms.date: 01/02/2024
10+
manager: nitinme
11+
keywords:
12+
---
13+
14+
# GPT-4 Turbo with Vision concepts
15+
16+
GPT-4 Turbo with Vision is a large multimodal model (LMM) developed by OpenAI that can analyze images and provide textual responses to questions about them. It incorporates both natural language processing and visual understanding. This guide provides details on the capabilities and limitations of GPT-4 Turbo with Vision.
17+
18+
To try out GPT-4 Turbo with Vision, see the [quickstart](/azure/ai-services/openai/gpt-v-quickstart).
19+
20+
## Chats with vision
21+
22+
The GPT-4 Turbo with Vision model answers general questions about what's present in the images or videos you upload.
23+
24+
25+
## Enhancements
26+
27+
Enhancements let you incorporate other Azure AI services to add new functionality to the chat-with-vision experience.
28+
29+
**Object grounding**: Azure AI Vision complements GPT-4 Turbo with Vision’s text response by identifying and locating salient objects in the input images. This lets the chat model give more accurate and detailed responses about the contents of the image.
30+
31+
:::image type="content" source="../media/concepts/gpt-v/object-grounding.png" alt-text="Screenshot of an image with object grounding applied. Object have bounding boxes with labels.":::
32+
33+
:::image type="content" source="../media/concepts/gpt-v/object-grounding-response.png" alt-text="Screenshot of a chat response to an image prompt about an outfit. The response is an itemized list of clothing items seen in the image.":::
34+
35+
**Optical Character Recognition (OCR)**: Azure AI Vision complements GPT-4 Turbo with Vision by providing high-quality OCR results as supplementary information to the chat model. It allows the model to produce higher quality responses for images with dense text, transformed images, and numbers-heavy financial documents, and increases the variety of languages the model can recognize in text.
36+
37+
:::image type="content" source="../media/concepts/gpt-v/receipts.png" alt-text="Photo of several receipts.":::
38+
39+
:::image type="content" source="../media/concepts/gpt-v/ocr-response.png" alt-text="Screenshot of the JSON response of an OCR call.":::
40+
41+
## Special pricing information
42+
43+
GPT-4 Turbo with Vision accrues charges like other Azure OpenAI chat models. You pay a per-token rate for the prompts and completions, detailed on the [Pricing page](/pricing/details/cognitive-services/openai-service/). The base charges and additional features are outlined here:
44+
45+
Base Pricing for GPT-4 Turbo with Vision is:
46+
- Input: $0.01 per 1000 tokens
47+
- Output: $0.03 per 1000 tokens
48+
49+
See the [Tokens section of the overview](/azure/ai-services/openai/overview#tokens) for information on how text and images translate to tokens.
50+
51+
Additionally, if you use video prompt integration with the Video Retrieval add-on, it accrues other costs:
52+
- Ingestion: $0.05 per minute of video
53+
- Transactions: $0.25 per 1000 queries of the Video Retrieval index
54+
55+
Processing videos involves the use of extra tokens to identify key frames for analysis. The number of these additional tokens will be roughly equivalent to the sum of the tokens in the text input, plus 700 tokens.
56+
57+
### Example price calculation
58+
59+
For a typical use case, take a 3-minute video with a 100-token prompt input. The video has a transcript that's 100 tokens long, and when the service processes the prompt, it generates 100 tokens of output. The pricing for this transaction would be:
60+
61+
| Item | Detail | Total Cost |
62+
|-----------------|-----------------|--------------|
63+
| GPT-4 Turbo with Vision Input Tokens | 100 text tokens | $0.001 |
64+
| Additional Cost to identify frames | 100 input tokens + 700 tokens + 1 Video Retrieval transaction | $0.00825 |
65+
| Image Inputs and Transcript Input | 20 images (85 tokens each) + 100 transcript tokens | $0.018 |
66+
| Output Tokens | 100 tokens (assumed) | $0.003 |
67+
| **Total Cost** | | **$0.03025** |
68+
69+
Additionally, there's a one-time indexing cost of $0.15 to generate the Video Retrieval index for this 3-minute video. This index can be reused across any number of Video Retrieval and GPT-4 Turbo with Vision calls.
70+
71+
## Limitations
72+
73+
This section describes the limitations of GPT-4 Turbo with Vision.
74+
75+
### Image support
76+
77+
- **Limitation on image enhancements per chat session**: Enhancements cannot be applied to multiple images within a single chat call.
78+
- **Maximum input image size**: The maximum size for input images is restricted to 20 MB.
79+
- **Object grounding in enhancement API**: When the enhancement API is used for object grounding, and the model detects duplicates of an object, it will generate one bounding box and label for all the duplicates instead of separate ones for each.
80+
- **Low resolution accuracy**: When images are analyzed using the "low resolution" setting, it allows for faster responses and uses fewer input tokens for certain use cases. However, this could impact the accuracy of object and text recognition within the image.
81+
- **Image chat restriction**: When uploading images in the chat playground or the API, there is a limit of 10 images per chat call.
82+
83+
### Video support
84+
85+
- **Low resolution**: Video frames are analyzed using GPT-4 Turbo with Vision's "low resolution" setting, which may affect the accuracy of small object and text recognition in the video.
86+
- **Video file limits**: Both MP4 and MOV file types are supported. In the Azure AI Playground, videos must be less than 3 minutes long. When you use the API there is no such limitation.
87+
- **Prompt limits**: Video prompts only contain one video and no images. In Playground, you can clear the session to try another video or images.
88+
- **Limited frame selection**: The service selects 20 frames from the entire video, which might not capture all the critical moments or details. Frame selection can be approximately evenly spread through the video or focused by a specific video retrieval query, depending on the prompt.
89+
- **Language support**: The service primarily supports English for grounding with transcripts. Transcripts don't provide accurate information on lyrics in songs.
90+
91+
## Next steps
92+
93+
- Get started using GPT-4 Turbo with Vision by following the [quickstart](/azure/ai-services/openai/gpt-v-quickstart).
94+
- For a more in-depth look at the APIs, and to use video prompts in chat, follow the [how-to guide](../how-to/gpt-with-vision.md).
95+
- See the [completions and embeddings API reference](../reference.md)

articles/ai-services/openai/how-to/gpt-with-vision.md

Lines changed: 0 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -406,51 +406,6 @@ Every response includes a `"finish_details"` field. The subfield `"type"` has th
406406

407407
If `finish_details.type` is `stop`, then there is another `"stop"` property that specifies the token that caused the output to end.
408408

409-
### Pricing example for Video prompts
410-
The pricing for GPT-4 Turbo with Vision is dynamic and depends on the specific features and inputs used. For a comprehensive view of Azure OpenAI pricing see [Azure OpenAI Pricing](https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/).
411-
412-
The base charges and additional features are outlined below:
413-
414-
Base Pricing for GPT-4 Turbo with Vision is:
415-
- Input: $0.01 per 1000 tokens
416-
- Output: $0.03 per 1000 tokens
417-
418-
Video prompt integration with Video Retrieval Add-on:
419-
- Ingestion: $0.05 per minute of video
420-
- Transactions: $0.25 per 1000 queries of the Video Retrieval index
421-
422-
Processing videos will involve the use of extra tokens to identify key frames for analysis. The number of these additional tokens will be roughly equivalent to the sum of the tokens in the text input plus 700 tokens.
423-
424-
#### Calculation
425-
For a typical use case let's imagine that I have use a 3-minute video with a 100-token prompt input. The section of video has a transcript that's 100-tokens long and when I process the prompt, I generate 100-tokens of output. The pricing for this transaction would be as follows:
426-
427-
| Item | Detail | Total Cost |
428-
|-------------------------------------------|---------------------------------------------------------------|--------------|
429-
| GPT-4 Turbo with Vision Input Tokens | 100 text tokens | $0.001 |
430-
| Additional Cost to identify frames | 100 input tokens + 700 tokens + 1 Video Retrieval txn | $0.00825 |
431-
| Image Inputs and Transcript Input | 20 images (85 tokens each) + 100 transcript tokens | $0.018 |
432-
| Output Tokens | 100 tokens (assumed) | $0.003 |
433-
| **Total Cost** | | **$0.03025** |
434-
435-
Additionally, there's a one-time indexing cost of $0.15 to generate the Video Retrieval index for this 3-minute segment of video. This index can be reused across any number of Video Retrieval and GPT-4 Turbo with Vision calls.
436-
437-
## Limitations
438-
439-
### Image support
440-
441-
- **Limitation on image enhancements per chat session**: Enhancements cannot be applied to multiple images within a single chat call.
442-
- **Maximum input image size**: The maximum size for input images is restricted to 20 MB.
443-
- **Object grounding in enhancement API**: When the enhancement API is used for object grounding, and the model detects duplicates of an object, it will generate one bounding box and label for all the duplicates instead of separate ones for each.
444-
- **Low resolution accuracy**: When images are analyzed using the "low resolution" setting, it allows for faster responses and uses fewer input tokens for certain use cases. However, this could impact the accuracy of object and text recognition within the image.
445-
- **Image chat restriction**: When uploading images in the chat playground or the API, there is a limit of 10 images per chat call.
446-
447-
### Video support
448-
449-
- **Low resolution**: Video frames are analyzed using GPT-4 Turbo with Vision's "low resolution" setting, which may affect the accuracy of small object and text recognition in the video.
450-
- **Video file limits**: Both MP4 and MOV file types are supported. In the Azure AI Playground, videos must be less than 3 minutes long. When you use the API there is no such limitation.
451-
- **Prompt limits**: Video prompts only contain one video and no images. In Playground, you can clear the session to try another video or images.
452-
- **Limited frame selection**: The service selects 20 frames from the entire video, which might not capture all the critical moments or details. Frame selection can be approximately evenly spread through the video or focused by a specific video retrieval query, depending on the prompt.
453-
- **Language support**: The service primarily supports English for grounding with transcripts. Transcripts don't provide accurate information on lyrics in songs.
454409

455410
## Next steps
456411

57.3 KB
Loading
565 KB
Loading
60.8 KB
Loading
413 KB
Loading

articles/ai-services/openai/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,8 @@ items:
4545
href: ./concepts/understand-embeddings.md
4646
- name: Fine-tuning
4747
href: ./concepts/fine-tuning-considerations.md
48+
- name: GPT-4 Turbo with Vision
49+
href: ./concepts/gpt-with-vision.md
4850
- name: Red teaming large language models (LLMs)
4951
href: ./concepts/red-teaming.md
5052
- name: Content credentials

0 commit comments

Comments
 (0)