feat(transform, chat, gemini, media): Gemini enable video processing #6150

VooDisss · 2025-07-24T05:10:27Z

Related GitHub Issue

Closes: #6144
Big thanks to jordanrendric for helping by providing working proof of concept that it's possible.

Roo Code Task Context (Optional)

Description

This pull request introduces support for video content processing for Gemini models. The changes generalize the media handling in the chat UI to support both images and videos, and update the data transformation logic to correctly format video content for the Gemini API.

Key implementation details:

Gemini Transformer Update: The gemini-format.ts transformer has been extended to process video content blocks, ensuring they are correctly converted to the Gemini API format. It also now sorts content parts to place media before text.
Generalized Media UI: The chat interface has been refactored to handle generic media types instead of just images. This includes:
- Replacing selectedImages state with selectedMedia in ChatView.tsx, ChatRow.tsx, and ChatTextArea.tsx.
- Introducing a new MediaThumbnails component to display thumbnails for different media types (images and videos).
Dynamic File Type Acceptance: ChatView.tsx now dynamically determines the accepted file types based on the selected model's capabilities, enabling video formats like MP4, MOV, etc., for supported Gemini models.
MIME Type Utility: A new getMimeType utility function was added to reliably identify media types from data URIs.

Test Procedure

The changes have been tested through both automated unit tests and manual verification.

Unit Tests:

New unit tests have been added to gemini-format.spec.ts to cover video and mixed-media content transformations.
To run the tests, execute the following command from the src directory: npx vitest run api/transform/__tests__/gemini-format.spec.ts

Manual Testing:
Reviewers can verify the changes by following these steps:

Select a Gemini model that supports video input (e.g., gemini-2.5-pro).
Drag and drop or paste a video file (e.g., an .mp4 or .mov file) into the chat text area.
Verify that a video thumbnail appears in the composer.
Send a message containing the video.
Verify that the message is sent and the model processes the video content in its response.
Test with other media combinations (e.g., images, text and video) to ensure they are handled correctly.

Pre-Submission Checklist

Issue Linked: This PR is linked to an approved GitHub Issue (see "Related GitHub Issue" above).
Scope: My changes are focused on the linked issue (one major feature/fix per PR).
Self-Review: I have performed a thorough self-review of my code.
Testing: New and/or updated tests have been added to cover my changes (if applicable).
Documentation Impact: I have considered if my changes require documentation updates (see "Documentation Updates" section below).
Contribution Guidelines: I have read and agree to the Contributor Guidelines.

Screenshots / Videos

Documentation Updates

No documentation updates are required.

Additional Notes

This change paves the way for supporting more multimodal inputs in the future.

Get in Touch

Important

Enable video processing for Gemini models by updating data transformation, UI components, and utilities to support video content alongside images.

Behavior:
- Extend convertAnthropicContentToGemini in gemini-format.ts to process video content blocks, validating MIME types and base64 format.
- Update ChatView.tsx and ChatTextArea.tsx to handle video files, replacing selectedImages with selectedMedia.
- Add getAcceptedFileTypes in media-config.ts to determine supported media types based on model ID.
UI Components:
- Introduce MediaThumbnails component in Thumbnails.tsx to display video and image thumbnails.
- Update ChatRow.tsx and ChatTextArea.tsx to support dynamic media handling.
Utilities:
- Add getMimeType, isVideoMimeType, and isImageMimeType in media.ts for MIME type handling.
- Refactor process-images.ts to use getMimeType for file type validation.
Tests:
- Add tests in gemini-format.spec.ts for video content transformation.
- Update ChatTextArea.spec.tsx to test media handling and prompt history navigation.

^{This description was created by}^{for d07c23a315f0f0bea23ef9f3014d197e84ccb2e9. You can customize this summary. It will automatically update as commits are pushed.}

- Consolidate duplicate getMimeType functions into shared utilities - Remove duplicate MediaThumbnails component, enhance Thumbnails to support video - Add JSDoc comments to VideoContentBlock interface - Convert inline styles to Tailwind classes in ChatRow - Add robust error handling for video processing - Create centralized media configuration for accepted file types - Ensure consistent test naming conventions - Fix ESLint warnings

hannesrudolph · 2025-07-24T06:09:41Z

Hi @VooDisss,

I've addressed all the review feedback in a new branch pr-6150. The changes include:

✅ Consolidated duplicate getMimeType functions into shared utilities
✅ Removed duplicate MediaThumbnails component and enhanced Thumbnails to support video
✅ Added JSDoc comments to the VideoContentBlock interface
✅ Converted inline styles to Tailwind classes in ChatRow
✅ Added robust error handling for video processing
✅ Created centralized media configuration for accepted file types
✅ Fixed all ESLint warnings

You can view the changes at: https://github.com/RooCodeInc/Roo-Code/tree/pr-6150

Since this PR is from your fork, you'll need to either:

Cherry-pick the commits from the pr-6150 branch into your fork's main branch
Or close this PR and I can create a new one with all the changes

Thank you for your contribution!

VooDisss · 2025-07-24T15:25:53Z

@hannesrudolph thank you for your edits in your pr-6150.

I have ran gh pr checkout 6150 and compiled the extension and it works, your edition even fixed some UI parts of it, thank you!
I'm attaching the .gif that it works:

daniel-lxs

Hey @VooDisss, Thank you for your contribution! I left some suggestions to make the implementation more robust, let me know what you think or if you have any questions.

daniel-lxs · 2025-07-28T21:04:44Z

src/api/transform/gemini-format.ts

The base64 validation regex /^[A-Za-z0-9+/]*={0,2}$/ might be too permissive - it would accept an empty string as valid base64. This could cause issues when the Gemini API tries to process empty video data.

Could we add a minimum length check? For example:

if (!base64Regex.test(block.source.data.replace(/\s/g, "")) || block.source.data.trim().length < 4) { throw new Error("Invalid or empty base64 format for video data") }

Or perhaps use a more robust validation approach like attempting to decode a small portion of the base64 string?

daniel-lxs · 2025-07-28T21:05:09Z

src/api/transform/gemini-format.ts

The supported video formats are hard-coded here and also in webview-ui/src/utils/media-config.ts. This duplication could lead to maintenance issues if formats need to be added or removed.

I think we can move these to packages/types so they can be imported from @roo-code/types in both the frontend and backend.

daniel-lxs · 2025-07-28T21:05:19Z

webview-ui/src/components/common/Thumbnails.tsx

The video thumbnail rendering doesn't have error handling like the image rendering does (lines 89-99). It might be a good idea to add some error handling to it.

daniel-lxs · 2025-07-28T21:05:30Z

webview-ui/src/components/common/Thumbnails.tsx

The video thumbnail could benefit from better accessibility. Currently, the title only shows the MIME type.

Consider adding:

An aria-label that describes this as a video file

The file size if available

Video duration if that information is accessible

Example:

aria-label={`Video file: ${mimeType || 'Unknown format'}`} role="img"

daniel-lxs · 2025-07-28T21:40:24Z

Closing for now

nnWhisperer · 2025-08-03T03:10:58Z

That would be a nice feature, I hope that it will be merged to the mainline soon, thank you for your contributions

nnWhisperer · 2025-08-03T16:39:39Z

Hello, while you are at it, could you add fps parameter? By default, Gemini works by sampling 1 fps of images from videos, which would be small for example when giving a flickering buggy screen, which is quite common where videos will be used for development. Here is the reference. another reference to metadata in the gemini js library

VooDisss · 2025-08-03T22:34:06Z

@nnWhisperer, if you want to use this Gemini video support feature and you want it now, I suggest you compile your own .vsix of the extension from https://github.com/VooDisss/Roo-Code/tree/main

It does not use File API, so it is not limited to 1fps analysis. It converts your video into base64 and sends it together with the prompt (in text format) in the HTTP API request.

Issues Noticed (using it for the past 2 weeks):
I have noticed 2 issues so far:

LLM Switching Problem:
When you use Gemini LLM and upload a video, and then switch to Horizon Beta (or maybe some other LLM model) - it will fail to send the request (400 Failed to extract 1 image(s)). Thus, you are locked into using Gemini for that chat.
- Workaround: Use orchestrator mode and invoke a sub-task in which you send the video through Gemini, and it responds with full analysis when it and attempt_complete and pushes that context into the main chat.
Video Size Limitation:
It is pretty limited to the size of the video. It really does not support <20MB videos, like in Gemini documentation it says it can accept requests below that size (including the whole context that is sent with the video).
- My Solution: When it fails, I use Handbrake and compress my videos using HEVC H265 NVENC to about 4MB and sometimes to 1MB size so it accepts. (I think it should accept way bigger videos, but I'm scared to touch the code while it still fulfills my current needs).

Development Status:
Basically, I pulled it to about 70% done, @hannesrudolph pulled it forward to about 80-90% done, and there is little more work needed to figure out these bugs...

Regarding Official Documentation:
Documentation says that one needs to use Gemini File API if the sent HTTP API request is >=20MB, but then it becomes limited to 1 fps as far as I read in your linked documentation (good job! I did not notice that).

nnWhisperer · 2025-08-04T00:47:21Z

Thank you for the compliment.
I suggest let's break the task down and not try to support all video models at once. Gemini has nice price/performance AFAIK. I'm not very familiar with other models(like Horizon Beta you mentioned). Let's focus in Gemini's api for this one and it should be OK to not be able to switch the model for the first video support feature.
Sending the videos in the messages would not be good, because, they will be sent again in all the following messages too. Hence, they are not single-use. What's good about the files api usage is that, it won't have to send videos on every chat message from the IDE. If not using Files API and using for example base64 for 10mb, at every subsequent message this 10+mb message of base64 will be sent, hence it would possibly be a slow user interaction.
Openai doesn't support videos (may be recently), hence, we could expect that they will follow google's steps to provide something akin to file api in the near future, otherwise megabytes of video data will be redundantly transferred between users and endpoints.

VooDisss · 2025-08-04T00:58:40Z

@nnWhisperer Below I show you attached video, which increased the context by 9.2k tokens, just for your reference...

You raised a good point why files API should be a preference (although then would have to upload it everytime, instead of having it in context for several messages). I think there needs to be an ability to remove particular messages or videos from the context of the chat instead (but that would be another issue). Also - I just tried uploading a 41 seconds 11.4MB video and it failed.

nnWhisperer · 2025-08-04T10:05:27Z

When it is uploaded, it is uploaded once and a variable to the file is used, it isn't uploaded on every message, hence providing the speed advantage

daniel-lxs

Hey @VooDisss I took a look again, there are some points from my previous review that might need addressing.

Overall I think the idea and implementation is solid, thank you and sorry for the delay!

daniel-lxs · 2025-08-04T14:40:52Z

webview-ui/src/utils/media-config.ts

The hardcoded list of Gemini model IDs here is quite long and will require manual updates whenever new models are added. Consider using a pattern matching approach instead:

Suggested change

const isGeminiWithVideo =

const isGeminiWithVideo = modelId?.includes("gemini-") &&

(modelId.includes("-pro") || modelId.includes("-flash"));

Or better yet, this information could come from the model configuration itself.

daniel-lxs · 2025-08-04T14:40:53Z

webview-ui/src/components/chat/ChatView.tsx

The constant name MAX_IMAGES_PER_MESSAGE is now misleading since it applies to both images and videos. Consider renaming to MAX_MEDIA_PER_MESSAGE to better reflect its purpose.

daniel-lxs · 2025-08-04T14:40:53Z

webview-ui/src/components/chat/ChatTextArea.tsx

The warning message still references images when it should be more generic for media:

Suggested change

console.warn(t("chat:noValidImages"))

console.warn(t("chat:noValidMedia"))

You'll also need to update the corresponding translation key.

daniel-lxs · 2025-08-04T14:40:53Z

src/api/transform/gemini-format.ts

Based on the PR discussion, there's a < 20MB limitation for video files when using the HTTP API. Consider adding validation here to provide better error messages:

Suggested change

if (!block.source.data || block.source.data.trim() === "") {

// Check if video data exists

if (!block.source.data || block.source.data.trim() === "") {

throw new Error("Video data is empty or missing")

}

// Rough estimate: base64 is ~1.33x the original size

const estimatedSize = (block.source.data.length * 0.75) / (1024 * 1024); // in MB

if (estimatedSize > 20) {

throw new Error(`Video size (~${estimatedSize.toFixed(1)}MB) exceeds the 20MB limit for direct API calls. Consider using the Files API for larger videos.`);

}

nnWhisperer

You can add the videometadata and fps parameter in this pull request, I'm showing you the place to add in the reviewed messages here. They are supported even when the video is embedded inside the message, for example here is a python version code that uses it:

# Only for videos of size <20Mb
video_file_name = "/path/to/your/video.mp4"
video_bytes = open(video_file_name, 'rb').read()

response = client.models.generate_content(
    model='models/gemini-2.5-flash',
    contents=types.Content(
        parts=[
            types.Part(
                inline_data=types.Blob(
                    data=video_bytes,
                    mime_type='video/mp4'),
                video_metadata=types.VideoMetadata(fps=5)
            ),
            types.Part(text='Please summarize the video in 3 sentences.')
        ]
    )
)

nnWhisperer · 2025-08-03T16:47:35Z

src/api/transform/gemini-format.ts

here you can add a metadata parameter, after the inline parameter, something similar to this:
video_metadata=types.VideoMetadata(fps=5), where 5 should be configurable

https://github.com/googleapis/js-genai/blob/9841ecb359d57648e284271fdf3a477ca3c5d6f1/src/types.ts#L928

- Consolidate duplicate getMimeType functions into shared utilities - Remove duplicate MediaThumbnails component, enhance Thumbnails to support video - Add JSDoc comments to VideoContentBlock interface - Convert inline styles to Tailwind classes in ChatRow - Add robust error handling for video processing - Create centralized media configuration for accepted file types - Ensure consistent test naming conventions - Fix ESLint warnings

isCopyman · 2025-09-15T02:39:47Z

I noticed that PDF and video processing might be the same. Can we add support for using Gemini with multimodal reading of PDFs? It seems that not too many modifications are needed. #7266

VooDisss requested review from cte, jr and mrubens as code owners July 24, 2025 05:10

github-project-automation bot added this to Roo Code Roadmap Jul 24, 2025

github-project-automation bot moved this to New in Roo Code Roadmap Jul 24, 2025

github-project-automation bot added this to Roo Code Roadmap Jul 24, 2025

github-project-automation bot moved this to Triage in Roo Code Roadmap Jul 24, 2025

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Jul 24, 2025

hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Jul 24, 2025

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jul 24, 2025

daniel-lxs moved this from Triage to PR [Needs Prelim Review] in Roo Code Roadmap Jul 24, 2025

daniel-lxs mentioned this pull request Jul 24, 2025

feat: enable video uploads for Gemini 2.5 Pro models #6145

Closed

hannesrudolph added PR - Needs Preliminary Review and removed Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. labels Jul 24, 2025

daniel-lxs reviewed Jul 28, 2025

View reviewed changes

daniel-lxs moved this from PR [Needs Prelim Review] to PR [Changes Requested] in Roo Code Roadmap Jul 28, 2025

hannesrudolph added PR - Changes Requested and removed PR - Needs Preliminary Review labels Jul 28, 2025

daniel-lxs closed this Jul 28, 2025

github-project-automation bot moved this from New to Done in Roo Code Roadmap Jul 28, 2025

github-project-automation bot moved this from PR [Changes Requested] to Done in Roo Code Roadmap Jul 28, 2025

daniel-lxs reopened this Aug 1, 2025

github-project-automation bot moved this from Done to Triage in Roo Code Roadmap Aug 1, 2025

github-project-automation bot moved this from Done to New in Roo Code Roadmap Aug 1, 2025

daniel-lxs moved this from Triage to PR [Needs Prelim Review] in Roo Code Roadmap Aug 1, 2025

hannesrudolph added PR - Needs Preliminary Review and removed PR - Changes Requested labels Aug 1, 2025

daniel-lxs reviewed Aug 4, 2025

View reviewed changes

daniel-lxs moved this from PR [Needs Prelim Review] to PR [Changes Requested] in Roo Code Roadmap Aug 4, 2025

hannesrudolph added PR - Changes Requested and removed PR - Needs Preliminary Review labels Aug 4, 2025

nnWhisperer suggested changes Aug 4, 2025

View reviewed changes

VooDisss closed this Aug 16, 2025

VooDisss force-pushed the main branch from d07c23a to 52c58ea Compare August 16, 2025 02:52

github-project-automation bot moved this from PR [Changes Requested] to Done in Roo Code Roadmap Aug 16, 2025

github-project-automation bot moved this from New to Done in Roo Code Roadmap Aug 16, 2025

dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Aug 16, 2025

	const isGeminiWithVideo =
	const isGeminiWithVideo = modelId?.includes("gemini-") &&
	(modelId.includes("-pro") \|\| modelId.includes("-flash"));

	console.warn(t("chat:noValidImages"))
	console.warn(t("chat:noValidMedia"))

-				if (!block.source.data || block.source.data.trim() === "") {
+// Check if video data exists
+if (!block.source.data || block.source.data.trim() === "") {
+  throw new Error("Video data is empty or missing")
+}
+// Rough estimate: base64 is ~1.33x the original size
+const estimatedSize = (block.source.data.length * 0.75) / (1024 * 1024); // in MB
+if (estimatedSize > 20) {
+  throw new Error(`Video size (~${estimatedSize.toFixed(1)}MB) exceeds the 20MB limit for direct API calls. Consider using the Files API for larger videos.`);
+}

feat(transform, chat, gemini, media): Gemini enable video processing #6150

feat(transform, chat, gemini, media): Gemini enable video processing #6150

Uh oh!

Conversation

VooDisss commented Jul 24, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related GitHub Issue

Roo Code Task Context (Optional)

Description

Test Procedure

Pre-Submission Checklist

Screenshots / Videos

Documentation Updates

Additional Notes

Get in Touch

Uh oh!

hannesrudolph commented Jul 24, 2025

Uh oh!

VooDisss commented Jul 24, 2025

Uh oh!

daniel-lxs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daniel-lxs commented Jul 28, 2025

Uh oh!

nnWhisperer commented Aug 3, 2025

Uh oh!

nnWhisperer commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VooDisss commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nnWhisperer commented Aug 4, 2025

Uh oh!

VooDisss commented Aug 4, 2025

Uh oh!

nnWhisperer commented Aug 4, 2025

Uh oh!

daniel-lxs left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nnWhisperer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isCopyman commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

VooDisss commented Jul 24, 2025 •

edited by ellipsis-dev bot

Loading

nnWhisperer commented Aug 3, 2025 •

edited

Loading

VooDisss commented Aug 3, 2025 •

edited

Loading

daniel-lxs left a comment •

edited

Loading

isCopyman commented Sep 15, 2025 •

edited

Loading