Skip to content

Commit 6f6d234

Browse files
committed
rm content from original doc
1 parent 171e47b commit 6f6d234

File tree

1 file changed

+1
-34
lines changed

1 file changed

+1
-34
lines changed

articles/ai-services/openai/how-to/gpt-with-vision.md

Lines changed: 1 addition & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -400,40 +400,7 @@ Base Pricing for GPT-4 Turbo with Vision is:
400400

401401
Video prompt integration with Video Retrieval Add-on:
402402
- Ingestion: $0.05 per minute of video
403-
- Transactions: $0.25 per 1000 queries of the Video Retrieval index
404-
405-
Processing videos will involve the use of extra tokens to identify key frames for analysis. The number of these additional tokens will be roughly equivalent to the sum of the tokens in the text input plus 700 tokens.
406-
407-
#### Calculation
408-
For a typical use case let's imagine that I have use a 3-minute video with a 100-token prompt input. The section of video has a transcript that's 100-tokens long and when I process the prompt, I generate 100-tokens of output. The pricing for this transaction would be as follows:
409-
410-
| Item | Detail | Total Cost |
411-
|-------------------------------------------|---------------------------------------------------------------|--------------|
412-
| GPT-4 Turbo with Vision Input Tokens | 100 text tokens | $0.001 |
413-
| Additional Cost to identify frames | 100 input tokens + 700 tokens + 1 Video Retrieval txn | $0.00825 |
414-
| Image Inputs and Transcript Input | 20 images (85 tokens each) + 100 transcript tokens | $0.018 |
415-
| Output Tokens | 100 tokens (assumed) | $0.003 |
416-
| **Total Cost** | | **$0.03025** |
417-
418-
Additionally, there's a one-time indexing cost of $0.15 to generate the Video Retrieval index for this 3-minute segment of video. This index can be reused across any number of Video Retrieval and GPT-4 Turbo with Vision calls.
419-
420-
## Limitations
421-
422-
### Image support
423-
424-
- **Limitation on image enhancements per chat session**: Enhancements cannot be applied to multiple images within a single chat call.
425-
- **Maximum input image size**: The maximum size for input images is restricted to 20 MB.
426-
- **Object grounding in enhancement API**: When the enhancement API is used for object grounding, and the model detects duplicates of an object, it will generate one bounding box and label for all the duplicates instead of separate ones for each.
427-
- **Low resolution accuracy**: When images are analyzed using the "low resolution" setting, it allows for faster responses and uses fewer input tokens for certain use cases. However, this could impact the accuracy of object and text recognition within the image.
428-
- **Image chat restriction**: When uploading images in the chat playground or the API, there is a limit of 10 images per chat call.
429-
430-
### Video support
431-
432-
- **Low resolution**: Video frames are analyzed using GPT-4 Turbo with Vision's "low resolution" setting, which may affect the accuracy of small object and text recognition in the video.
433-
- **Video file limits**: Both MP4 and MOV file types are supported. In the Azure AI Playground, videos must be less than 3 minutes long. When you use the API there is no such limitation.
434-
- **Prompt limits**: Video prompts only contain one video and no images. In Playground, you can clear the session to try another video or images.
435-
- **Limited frame selection**: The service selects 20 frames from the entire video, which might not capture all the critical moments or details. Frame selection can be approximately evenly spread through the video or focused by a specific video retrieval query, depending on the prompt.
436-
- **Language support**: The service primarily supports English for grounding with transcripts. Transcripts don't provide accurate information on lyrics in songs.
403+
- Transactions: $0.25 per 1000 queries of the Video Retrieval indexer
437404

438405
## Next steps
439406

0 commit comments

Comments
 (0)