-
Notifications
You must be signed in to change notification settings - Fork 2.6k
feat(transform, chat, gemini, media): Gemini enable video processing #6150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Consolidate duplicate getMimeType functions into shared utilities - Remove duplicate MediaThumbnails component, enhance Thumbnails to support video - Add JSDoc comments to VideoContentBlock interface - Convert inline styles to Tailwind classes in ChatRow - Add robust error handling for video processing - Create centralized media configuration for accepted file types - Ensure consistent test naming conventions - Fix ESLint warnings
|
Hi @VooDisss, I've addressed all the review feedback in a new branch
You can view the changes at: https://github.com/RooCodeInc/Roo-Code/tree/pr-6150 Since this PR is from your fork, you'll need to either:
Thank you for your contribution! |
|
@hannesrudolph thank you for your edits in your I have ran |
daniel-lxs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @VooDisss, Thank you for your contribution! I left some suggestions to make the implementation more robust, let me know what you think or if you have any questions.
src/api/transform/gemini-format.ts
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The base64 validation regex /^[A-Za-z0-9+/]*={0,2}$/ might be too permissive - it would accept an empty string as valid base64. This could cause issues when the Gemini API tries to process empty video data.
Could we add a minimum length check? For example:
if (!base64Regex.test(block.source.data.replace(/\s/g, "")) || block.source.data.trim().length < 4) {
throw new Error("Invalid or empty base64 format for video data")
}Or perhaps use a more robust validation approach like attempting to decode a small portion of the base64 string?
src/api/transform/gemini-format.ts
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The supported video formats are hard-coded here and also in webview-ui/src/utils/media-config.ts. This duplication could lead to maintenance issues if formats need to be added or removed.
I think we can move these to packages/types so they can be imported from @roo-code/types in both the frontend and backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The video thumbnail rendering doesn't have error handling like the image rendering does (lines 89-99). It might be a good idea to add some error handling to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The video thumbnail could benefit from better accessibility. Currently, the title only shows the MIME type.
Consider adding:
- An
aria-labelthat describes this as a video file - The file size if available
- Video duration if that information is accessible
Example:
aria-label={`Video file: ${mimeType || 'Unknown format'}`}
role="img"|
Closing for now |
|
That would be a nice feature, I hope that it will be merged to the mainline soon, thank you for your contributions |
|
Hello, while you are at it, could you add fps parameter? By default, Gemini works by sampling 1 fps of images from videos, which would be small for example when giving a flickering buggy screen, which is quite common where videos will be used for development. Here is the reference. another reference to metadata in the gemini js library |
|
@nnWhisperer, if you want to use this Gemini video support feature and you want it now, I suggest you compile your own .vsix of the extension from It does not use File API, so it is not limited to 1fps analysis. It converts your video into base64 and sends it together with the prompt (in text format) in the HTTP API request. Issues Noticed (using it for the past 2 weeks):
Development Status: Regarding Official Documentation: |
|
Thank you for the compliment. |
|
@nnWhisperer Below I show you attached video, which increased the context by 9.2k tokens, just for your reference... You raised a good point why files API should be a preference (although then would have to upload it everytime, instead of having it in context for several messages). I think there needs to be an ability to remove particular messages or videos from the context of the chat instead (but that would be another issue). Also - I just tried uploading a 41 seconds 11.4MB video and it failed.
|
|
When it is uploaded, it is uploaded once and a variable to the file is used, it isn't uploaded on every message, hence providing the speed advantage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @VooDisss I took a look again, there are some points from my previous review that might need addressing.
Overall I think the idea and implementation is solid, thank you and sorry for the delay!
webview-ui/src/utils/media-config.ts
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded list of Gemini model IDs here is quite long and will require manual updates whenever new models are added. Consider using a pattern matching approach instead:
| const isGeminiWithVideo = | |
| const isGeminiWithVideo = modelId?.includes("gemini-") && | |
| (modelId.includes("-pro") || modelId.includes("-flash")); |
Or better yet, this information could come from the model configuration itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant name MAX_IMAGES_PER_MESSAGE is now misleading since it applies to both images and videos. Consider renaming to MAX_MEDIA_PER_MESSAGE to better reflect its purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning message still references images when it should be more generic for media:
| console.warn(t("chat:noValidImages")) | |
| console.warn(t("chat:noValidMedia")) |
You'll also need to update the corresponding translation key.
src/api/transform/gemini-format.ts
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the PR discussion, there's a < 20MB limitation for video files when using the HTTP API. Consider adding validation here to provide better error messages:
| if (!block.source.data || block.source.data.trim() === "") { | |
| // Check if video data exists | |
| if (!block.source.data || block.source.data.trim() === "") { | |
| throw new Error("Video data is empty or missing") | |
| } | |
| // Rough estimate: base64 is ~1.33x the original size | |
| const estimatedSize = (block.source.data.length * 0.75) / (1024 * 1024); // in MB | |
| if (estimatedSize > 20) { | |
| throw new Error(`Video size (~${estimatedSize.toFixed(1)}MB) exceeds the 20MB limit for direct API calls. Consider using the Files API for larger videos.`); | |
| } |
nnWhisperer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add the videometadata and fps parameter in this pull request, I'm showing you the place to add in the reviewed messages here. They are supported even when the video is embedded inside the message, for example here is a python version code that uses it:
# Only for videos of size <20Mb
video_file_name = "/path/to/your/video.mp4"
video_bytes = open(video_file_name, 'rb').read()
response = client.models.generate_content(
model='models/gemini-2.5-flash',
contents=types.Content(
parts=[
types.Part(
inline_data=types.Blob(
data=video_bytes,
mime_type='video/mp4'),
video_metadata=types.VideoMetadata(fps=5)
),
types.Part(text='Please summarize the video in 3 sentences.')
]
)
)
src/api/transform/gemini-format.ts
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here you can add a metadata parameter, after the inline parameter, something similar to this:
video_metadata=types.VideoMetadata(fps=5), where 5 should be configurable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Consolidate duplicate getMimeType functions into shared utilities - Remove duplicate MediaThumbnails component, enhance Thumbnails to support video - Add JSDoc comments to VideoContentBlock interface - Convert inline styles to Tailwind classes in ChatRow - Add robust error handling for video processing - Create centralized media configuration for accepted file types - Ensure consistent test naming conventions - Fix ESLint warnings
- Consolidate duplicate getMimeType functions into shared utilities - Remove duplicate MediaThumbnails component, enhance Thumbnails to support video - Add JSDoc comments to VideoContentBlock interface - Convert inline styles to Tailwind classes in ChatRow - Add robust error handling for video processing - Create centralized media configuration for accepted file types - Ensure consistent test naming conventions - Fix ESLint warnings
|
I noticed that PDF and video processing might be the same. Can we add support for using Gemini with multimodal reading of PDFs? It seems that not too many modifications are needed. #7266 |


Related GitHub Issue
Closes: #6144
Big thanks to jordanrendric for helping by providing working proof of concept that it's possible.
Roo Code Task Context (Optional)
Description
This pull request introduces support for video content processing for Gemini models. The changes generalize the media handling in the chat UI to support both images and videos, and update the data transformation logic to correctly format video content for the Gemini API.
Key implementation details:
gemini-format.tstransformer has been extended to processvideocontent blocks, ensuring they are correctly converted to the Gemini API format. It also now sorts content parts to place media before text.selectedImagesstate withselectedMediainChatView.tsx,ChatRow.tsx, andChatTextArea.tsx.MediaThumbnailscomponent to display thumbnails for different media types (images and videos).ChatView.tsxnow dynamically determines the accepted file types based on the selected model's capabilities, enabling video formats like MP4, MOV, etc., for supported Gemini models.getMimeTypeutility function was added to reliably identify media types from data URIs.Test Procedure
The changes have been tested through both automated unit tests and manual verification.
Unit Tests:
gemini-format.spec.tsto cover video and mixed-media content transformations.srcdirectory:npx vitest run api/transform/__tests__/gemini-format.spec.tsManual Testing:
Reviewers can verify the changes by following these steps:
gemini-2.5-pro)..mp4or.movfile) into the chat text area.Pre-Submission Checklist
Screenshots / Videos
Documentation Updates
Additional Notes
This change paves the way for supporting more multimodal inputs in the future.
Get in Touch
Important
Enable video processing for Gemini models by updating data transformation, UI components, and utilities to support video content alongside images.
convertAnthropicContentToGeminiingemini-format.tsto process video content blocks, validating MIME types and base64 format.ChatView.tsxandChatTextArea.tsxto handle video files, replacingselectedImageswithselectedMedia.getAcceptedFileTypesinmedia-config.tsto determine supported media types based on model ID.MediaThumbnailscomponent inThumbnails.tsxto display video and image thumbnails.ChatRow.tsxandChatTextArea.tsxto support dynamic media handling.getMimeType,isVideoMimeType, andisImageMimeTypeinmedia.tsfor MIME type handling.process-images.tsto usegetMimeTypefor file type validation.gemini-format.spec.tsfor video content transformation.ChatTextArea.spec.tsxto test media handling and prompt history navigation.This description was created by
for d07c23a315f0f0bea23ef9f3014d197e84ccb2e9. You can customize this summary. It will automatically update as commits are pushed.