You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/openai/how-to/gpt-with-vision.md
+1-34Lines changed: 1 addition & 34 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -400,40 +400,7 @@ Base Pricing for GPT-4 Turbo with Vision is:
400
400
401
401
Video prompt integration with Video Retrieval Add-on:
402
402
- Ingestion: $0.05 per minute of video
403
-
- Transactions: $0.25 per 1000 queries of the Video Retrieval index
404
-
405
-
Processing videos will involve the use of extra tokens to identify key frames for analysis. The number of these additional tokens will be roughly equivalent to the sum of the tokens in the text input plus 700 tokens.
406
-
407
-
#### Calculation
408
-
For a typical use case let's imagine that I have use a 3-minute video with a 100-token prompt input. The section of video has a transcript that's 100-tokens long and when I process the prompt, I generate 100-tokens of output. The pricing for this transaction would be as follows:
Additionally, there's a one-time indexing cost of $0.15 to generate the Video Retrieval index for this 3-minute segment of video. This index can be reused across any number of Video Retrieval and GPT-4 Turbo with Vision calls.
419
-
420
-
## Limitations
421
-
422
-
### Image support
423
-
424
-
-**Limitation on image enhancements per chat session**: Enhancements cannot be applied to multiple images within a single chat call.
425
-
-**Maximum input image size**: The maximum size for input images is restricted to 20 MB.
426
-
-**Object grounding in enhancement API**: When the enhancement API is used for object grounding, and the model detects duplicates of an object, it will generate one bounding box and label for all the duplicates instead of separate ones for each.
427
-
-**Low resolution accuracy**: When images are analyzed using the "low resolution" setting, it allows for faster responses and uses fewer input tokens for certain use cases. However, this could impact the accuracy of object and text recognition within the image.
428
-
-**Image chat restriction**: When uploading images in the chat playground or the API, there is a limit of 10 images per chat call.
429
-
430
-
### Video support
431
-
432
-
-**Low resolution**: Video frames are analyzed using GPT-4 Turbo with Vision's "low resolution" setting, which may affect the accuracy of small object and text recognition in the video.
433
-
-**Video file limits**: Both MP4 and MOV file types are supported. In the Azure AI Playground, videos must be less than 3 minutes long. When you use the API there is no such limitation.
434
-
-**Prompt limits**: Video prompts only contain one video and no images. In Playground, you can clear the session to try another video or images.
435
-
-**Limited frame selection**: The service selects 20 frames from the entire video, which might not capture all the critical moments or details. Frame selection can be approximately evenly spread through the video or focused by a specific video retrieval query, depending on the prompt.
436
-
-**Language support**: The service primarily supports English for grounding with transcripts. Transcripts don't provide accurate information on lyrics in songs.
403
+
- Transactions: $0.25 per 1000 queries of the Video Retrieval indexer
0 commit comments