Skip to content

Conversation

@VooDisss
Copy link

@VooDisss VooDisss commented Jul 24, 2025

Related GitHub Issue

Closes: #6144
Big thanks to jordanrendric for helping by providing working proof of concept that it's possible.

Roo Code Task Context (Optional)

Description

This pull request introduces support for video content processing for Gemini models. The changes generalize the media handling in the chat UI to support both images and videos, and update the data transformation logic to correctly format video content for the Gemini API.

Key implementation details:

  • Gemini Transformer Update: The gemini-format.ts transformer has been extended to process video content blocks, ensuring they are correctly converted to the Gemini API format. It also now sorts content parts to place media before text.
  • Generalized Media UI: The chat interface has been refactored to handle generic media types instead of just images. This includes:
  • Dynamic File Type Acceptance: ChatView.tsx now dynamically determines the accepted file types based on the selected model's capabilities, enabling video formats like MP4, MOV, etc., for supported Gemini models.
  • MIME Type Utility: A new getMimeType utility function was added to reliably identify media types from data URIs.

Test Procedure

The changes have been tested through both automated unit tests and manual verification.

Unit Tests:

  • New unit tests have been added to gemini-format.spec.ts to cover video and mixed-media content transformations.
  • To run the tests, execute the following command from the src directory: npx vitest run api/transform/__tests__/gemini-format.spec.ts

Manual Testing:
Reviewers can verify the changes by following these steps:

  1. Select a Gemini model that supports video input (e.g., gemini-2.5-pro).
  2. Drag and drop or paste a video file (e.g., an .mp4 or .mov file) into the chat text area.
  3. Verify that a video thumbnail appears in the composer.
  4. Send a message containing the video.
  5. Verify that the message is sent and the model processes the video content in its response.
  6. Test with other media combinations (e.g., images, text and video) to ensure they are handled correctly.

Pre-Submission Checklist

  • Issue Linked: This PR is linked to an approved GitHub Issue (see "Related GitHub Issue" above).
  • Scope: My changes are focused on the linked issue (one major feature/fix per PR).
  • Self-Review: I have performed a thorough self-review of my code.
  • Testing: New and/or updated tests have been added to cover my changes (if applicable).
  • Documentation Impact: I have considered if my changes require documentation updates (see "Documentation Updates" section below).
  • Contribution Guidelines: I have read and agree to the Contributor Guidelines.

Screenshots / Videos

Code_24_17_29_20___8306

Documentation Updates

  • No documentation updates are required.

Additional Notes

This change paves the way for supporting more multimodal inputs in the future.

Get in Touch


Important

Enable video processing for Gemini models by updating data transformation, UI components, and utilities to support video content alongside images.

  • Behavior:
    • Extend convertAnthropicContentToGemini in gemini-format.ts to process video content blocks, validating MIME types and base64 format.
    • Update ChatView.tsx and ChatTextArea.tsx to handle video files, replacing selectedImages with selectedMedia.
    • Add getAcceptedFileTypes in media-config.ts to determine supported media types based on model ID.
  • UI Components:
    • Introduce MediaThumbnails component in Thumbnails.tsx to display video and image thumbnails.
    • Update ChatRow.tsx and ChatTextArea.tsx to support dynamic media handling.
  • Utilities:
    • Add getMimeType, isVideoMimeType, and isImageMimeType in media.ts for MIME type handling.
    • Refactor process-images.ts to use getMimeType for file type validation.
  • Tests:
    • Add tests in gemini-format.spec.ts for video content transformation.
    • Update ChatTextArea.spec.tsx to test media handling and prompt history navigation.

This description was created by Ellipsis for d07c23a315f0f0bea23ef9f3014d197e84ccb2e9. You can customize this summary. It will automatically update as commits are pushed.

@VooDisss VooDisss requested review from cte, jr and mrubens as code owners July 24, 2025 05:10
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Jul 24, 2025
@hannesrudolph hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Jul 24, 2025
hannesrudolph added a commit that referenced this pull request Jul 24, 2025
- Consolidate duplicate getMimeType functions into shared utilities
- Remove duplicate MediaThumbnails component, enhance Thumbnails to support video
- Add JSDoc comments to VideoContentBlock interface
- Convert inline styles to Tailwind classes in ChatRow
- Add robust error handling for video processing
- Create centralized media configuration for accepted file types
- Ensure consistent test naming conventions
- Fix ESLint warnings
@hannesrudolph
Copy link
Collaborator

Hi @VooDisss,

I've addressed all the review feedback in a new branch pr-6150. The changes include:

  1. ✅ Consolidated duplicate getMimeType functions into shared utilities
  2. ✅ Removed duplicate MediaThumbnails component and enhanced Thumbnails to support video
  3. ✅ Added JSDoc comments to the VideoContentBlock interface
  4. ✅ Converted inline styles to Tailwind classes in ChatRow
  5. ✅ Added robust error handling for video processing
  6. ✅ Created centralized media configuration for accepted file types
  7. ✅ Fixed all ESLint warnings

You can view the changes at: https://github.com/RooCodeInc/Roo-Code/tree/pr-6150

Since this PR is from your fork, you'll need to either:

  • Cherry-pick the commits from the pr-6150 branch into your fork's main branch
  • Or close this PR and I can create a new one with all the changes

Thank you for your contribution!

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jul 24, 2025
@VooDisss
Copy link
Author

@hannesrudolph thank you for your edits in your pr-6150.

I have ran gh pr checkout 6150 and compiled the extension and it works, your edition even fixed some UI parts of it, thank you!
I'm attaching the .gif that it works:

Code_24_17_29_20___8306

@daniel-lxs daniel-lxs moved this from Triage to PR [Needs Prelim Review] in Roo Code Roadmap Jul 24, 2025
@hannesrudolph hannesrudolph added PR - Needs Preliminary Review and removed Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. labels Jul 24, 2025
Copy link
Member

@daniel-lxs daniel-lxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @VooDisss, Thank you for your contribution! I left some suggestions to make the implementation more robust, let me know what you think or if you have any questions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base64 validation regex /^[A-Za-z0-9+/]*={0,2}$/ might be too permissive - it would accept an empty string as valid base64. This could cause issues when the Gemini API tries to process empty video data.

Could we add a minimum length check? For example:

if (!base64Regex.test(block.source.data.replace(/\s/g, "")) || block.source.data.trim().length < 4) {
    throw new Error("Invalid or empty base64 format for video data")
}

Or perhaps use a more robust validation approach like attempting to decode a small portion of the base64 string?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The supported video formats are hard-coded here and also in webview-ui/src/utils/media-config.ts. This duplication could lead to maintenance issues if formats need to be added or removed.

I think we can move these to packages/types so they can be imported from @roo-code/types in both the frontend and backend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The video thumbnail rendering doesn't have error handling like the image rendering does (lines 89-99). It might be a good idea to add some error handling to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The video thumbnail could benefit from better accessibility. Currently, the title only shows the MIME type.

Consider adding:

  • An aria-label that describes this as a video file
  • The file size if available
  • Video duration if that information is accessible

Example:

aria-label={`Video file: ${mimeType || 'Unknown format'}`}
role="img"

@daniel-lxs daniel-lxs moved this from PR [Needs Prelim Review] to PR [Changes Requested] in Roo Code Roadmap Jul 28, 2025
@daniel-lxs
Copy link
Member

Closing for now

@daniel-lxs daniel-lxs closed this Jul 28, 2025
@github-project-automation github-project-automation bot moved this from New to Done in Roo Code Roadmap Jul 28, 2025
@github-project-automation github-project-automation bot moved this from PR [Changes Requested] to Done in Roo Code Roadmap Jul 28, 2025
@daniel-lxs daniel-lxs reopened this Aug 1, 2025
@github-project-automation github-project-automation bot moved this from Done to Triage in Roo Code Roadmap Aug 1, 2025
@github-project-automation github-project-automation bot moved this from Done to New in Roo Code Roadmap Aug 1, 2025
@daniel-lxs daniel-lxs moved this from Triage to PR [Needs Prelim Review] in Roo Code Roadmap Aug 1, 2025
@nnWhisperer
Copy link

That would be a nice feature, I hope that it will be merged to the mainline soon, thank you for your contributions

@nnWhisperer
Copy link

nnWhisperer commented Aug 3, 2025

Hello, while you are at it, could you add fps parameter? By default, Gemini works by sampling 1 fps of images from videos, which would be small for example when giving a flickering buggy screen, which is quite common where videos will be used for development. Here is the reference. another reference to metadata in the gemini js library

@VooDisss
Copy link
Author

VooDisss commented Aug 3, 2025

@nnWhisperer, if you want to use this Gemini video support feature and you want it now, I suggest you compile your own .vsix of the extension from https://github.com/VooDisss/Roo-Code/tree/main

It does not use File API, so it is not limited to 1fps analysis. It converts your video into base64 and sends it together with the prompt (in text format) in the HTTP API request.

Issues Noticed (using it for the past 2 weeks):
I have noticed 2 issues so far:

  1. LLM Switching Problem:
    When you use Gemini LLM and upload a video, and then switch to Horizon Beta (or maybe some other LLM model) - it will fail to send the request (400 Failed to extract 1 image(s)). Thus, you are locked into using Gemini for that chat.

    • Workaround: Use orchestrator mode and invoke a sub-task in which you send the video through Gemini, and it responds with full analysis when it and attempt_complete and pushes that context into the main chat.
  2. Video Size Limitation:
    It is pretty limited to the size of the video. It really does not support <20MB videos, like in Gemini documentation it says it can accept requests below that size (including the whole context that is sent with the video).

    • My Solution: When it fails, I use Handbrake and compress my videos using HEVC H265 NVENC to about 4MB and sometimes to 1MB size so it accepts. (I think it should accept way bigger videos, but I'm scared to touch the code while it still fulfills my current needs).

Development Status:
Basically, I pulled it to about 70% done, @hannesrudolph pulled it forward to about 80-90% done, and there is little more work needed to figure out these bugs...

Regarding Official Documentation:
Documentation says that one needs to use Gemini File API if the sent HTTP API request is >=20MB, but then it becomes limited to 1 fps as far as I read in your linked documentation (good job! I did not notice that).

@nnWhisperer
Copy link

Thank you for the compliment.
I suggest let's break the task down and not try to support all video models at once. Gemini has nice price/performance AFAIK. I'm not very familiar with other models(like Horizon Beta you mentioned). Let's focus in Gemini's api for this one and it should be OK to not be able to switch the model for the first video support feature.
Sending the videos in the messages would not be good, because, they will be sent again in all the following messages too. Hence, they are not single-use. What's good about the files api usage is that, it won't have to send videos on every chat message from the IDE. If not using Files API and using for example base64 for 10mb, at every subsequent message this 10+mb message of base64 will be sent, hence it would possibly be a slow user interaction.
Openai doesn't support videos (may be recently), hence, we could expect that they will follow google's steps to provide something akin to file api in the near future, otherwise megabytes of video data will be redundantly transferred between users and endpoints.

@VooDisss
Copy link
Author

VooDisss commented Aug 4, 2025

@nnWhisperer Below I show you attached video, which increased the context by 9.2k tokens, just for your reference...

You raised a good point why files API should be a preference (although then would have to upload it everytime, instead of having it in context for several messages). I think there needs to be an ability to remove particular messages or videos from the context of the chat instead (but that would be another issue). Also - I just tried uploading a 41 seconds 11.4MB video and it failed.

image

@nnWhisperer
Copy link

When it is uploaded, it is uploaded once and a variable to the file is used, it isn't uploaded on every message, hence providing the speed advantage

Copy link
Member

@daniel-lxs daniel-lxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @VooDisss I took a look again, there are some points from my previous review that might need addressing.

Overall I think the idea and implementation is solid, thank you and sorry for the delay!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded list of Gemini model IDs here is quite long and will require manual updates whenever new models are added. Consider using a pattern matching approach instead:

Suggested change
const isGeminiWithVideo =
const isGeminiWithVideo = modelId?.includes("gemini-") &&
(modelId.includes("-pro") || modelId.includes("-flash"));

Or better yet, this information could come from the model configuration itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant name MAX_IMAGES_PER_MESSAGE is now misleading since it applies to both images and videos. Consider renaming to MAX_MEDIA_PER_MESSAGE to better reflect its purpose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning message still references images when it should be more generic for media:

Suggested change
console.warn(t("chat:noValidImages"))
console.warn(t("chat:noValidMedia"))

You'll also need to update the corresponding translation key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the PR discussion, there's a < 20MB limitation for video files when using the HTTP API. Consider adding validation here to provide better error messages:

Suggested change
if (!block.source.data || block.source.data.trim() === "") {
// Check if video data exists
if (!block.source.data || block.source.data.trim() === "") {
throw new Error("Video data is empty or missing")
}
// Rough estimate: base64 is ~1.33x the original size
const estimatedSize = (block.source.data.length * 0.75) / (1024 * 1024); // in MB
if (estimatedSize > 20) {
throw new Error(`Video size (~${estimatedSize.toFixed(1)}MB) exceeds the 20MB limit for direct API calls. Consider using the Files API for larger videos.`);
}

@daniel-lxs daniel-lxs moved this from PR [Needs Prelim Review] to PR [Changes Requested] in Roo Code Roadmap Aug 4, 2025
Copy link

@nnWhisperer nnWhisperer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add the videometadata and fps parameter in this pull request, I'm showing you the place to add in the reviewed messages here. They are supported even when the video is embedded inside the message, for example here is a python version code that uses it:

# Only for videos of size <20Mb
video_file_name = "/path/to/your/video.mp4"
video_bytes = open(video_file_name, 'rb').read()

response = client.models.generate_content(
    model='models/gemini-2.5-flash',
    contents=types.Content(
        parts=[
            types.Part(
                inline_data=types.Blob(
                    data=video_bytes,
                    mime_type='video/mp4'),
                video_metadata=types.VideoMetadata(fps=5)
            ),
            types.Part(text='Please summarize the video in 3 sentences.')
        ]
    )
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you can add a metadata parameter, after the inline parameter, something similar to this:
video_metadata=types.VideoMetadata(fps=5), where 5 should be configurable

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VooDisss VooDisss closed this Aug 16, 2025
@github-project-automation github-project-automation bot moved this from PR [Changes Requested] to Done in Roo Code Roadmap Aug 16, 2025
@github-project-automation github-project-automation bot moved this from New to Done in Roo Code Roadmap Aug 16, 2025
@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Aug 16, 2025
VooDisss added a commit to VooDisss/Roo-Code that referenced this pull request Aug 16, 2025
- Consolidate duplicate getMimeType functions into shared utilities
- Remove duplicate MediaThumbnails component, enhance Thumbnails to support video
- Add JSDoc comments to VideoContentBlock interface
- Convert inline styles to Tailwind classes in ChatRow
- Add robust error handling for video processing
- Create centralized media configuration for accepted file types
- Ensure consistent test naming conventions
- Fix ESLint warnings
VooDisss added a commit to VooDisss/Roo-Code that referenced this pull request Aug 27, 2025
- Consolidate duplicate getMimeType functions into shared utilities
- Remove duplicate MediaThumbnails component, enhance Thumbnails to support video
- Add JSDoc comments to VideoContentBlock interface
- Convert inline styles to Tailwind classes in ChatRow
- Add robust error handling for video processing
- Create centralized media configuration for accepted file types
- Ensure consistent test naming conventions
- Fix ESLint warnings
@isCopyman
Copy link

isCopyman commented Sep 15, 2025

I noticed that PDF and video processing might be the same. Can we add support for using Gemini with multimodal reading of PDFs? It seems that not too many modifications are needed. #7266

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request PR - Changes Requested size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

feat: Enable Video Uploads for Multimodal Analysis

5 participants