Fix Multimodal Gemini Batch Request Creation#690
Fix Multimodal Gemini Batch Request Creation#690alexisdrakopoulos wants to merge 1 commit intobespokelabsai:mainfrom
Conversation
|
@alexisdrakopoulos Thanks for the contribution, please extend the existing tests at |
kartik4949
left a comment
There was a problem hiding this comment.
Looking good, lets do the required changes and add an integration test for this
Thanks
| content = message["content"] | ||
|
|
||
| # Primary Path: Handle the multimodal dictionary from curator | ||
| if isinstance(content, dict) and ("texts" in content or "images" in content): |
There was a problem hiding this comment.
do you need the "texts" or "images" check?
| parts.append({"text": text_item}) | ||
|
|
||
| # Add all image parts from the 'images' list | ||
| for image_item in content.get("images", []): |
There was a problem hiding this comment.
there are "files" as well
| # Fallback to guessing the mime type if not provided | ||
| mime_type, _ = mimetypes.guess_type(url) | ||
| if not mime_type: | ||
| logger.warning(f"Could not determine MIME type for {url}. Defaulting to 'image/png'.") |
There was a problem hiding this comment.
I don't think it's a good idea to only log debug-level messages when the mime_type is missing, as this makes it hard for people to notice the problem—even though other API implementations might not log anything at all. I believe that at the very least a warning should be issued when mime_type is missing, or even an exception should be thrown.
| if generic_request.response_format: | ||
| request_object.update( | ||
| # Ensure generationConfig exists before trying to update it | ||
| if "generationConfig" not in request_object: |
There was a problem hiding this comment.
we are creating request_object with "contents" key only
how will this if statement hit?
What does this PR do?
This pull request adds support for multimodal inputs (specifically text and images) to the
GeminiBatchRequestProcessor. The existing implementation only handled text-based requests, causing errors when acurator.types.Imagewas included in the prompt.This change allows users to leverage Gemini's powerful vision capabilities, such as performing OCR on a large batch of images, through the
curatorlibrary's batch processing interface.Why is this change needed?
Previously, attempting to run a batch job with an image and text resulted in a
400 INVALID_ARGUMENTerror from the Gemini API. The root cause was that the request processor was not correctly formatting the multimodal payload. It tried to serialize the entire content dictionary into a{"text": ...}field, which is invalid.The Gemini Batch API requires multimodal inputs to be structured as a list of
partswithin a singlecontentsobject, where each part is either text ({"text": ...}) or file data ({"fileData": ...}). This PR implements that required structure.How was this change implemented?
The core of the change is within the
create_api_specific_request_batchmethod:message["content"]is the dictionary structure thatcuratorcreates for multimodal prompts (e.g.,{'texts': [...], 'images': [...]}).textsandimageslists within the content dictionary.{"text": "..."}part.{"fileData": {"fileUri": "...", "mimeType": "..."}}part. Themimetypeis automatically detected from the image URI.partsarray, creating a valid request payload for the Gemini Batch API.A fallback for legacy text-only requests is maintained for backward compatibility.
How to test this change?
A reviewer can test this by running a batch job with a
curator.LLMclass that returns both text and an image from itspromptmethod.Example: