Skip to content

design: Add 0004-multimodal-i2t proposal#674

Open
sangminwoo wants to merge 3 commits intostrands-agents:mainfrom
sangminwoo:main
Open

design: Add 0004-multimodal-i2t proposal#674
sangminwoo wants to merge 3 commits intostrands-agents:mainfrom
sangminwoo:main

Conversation

@sangminwoo
Copy link

@sangminwoo sangminwoo commented Mar 17, 2026

Description

Add design doc for multimodal image-to-text evaluation support in strands-evals SDK.

Introduces MultimodalOutputEvaluator extending OutputEvaluator to enable MLLM-as-a-Judge evaluation for image/document-to-text tasks. The evaluator constructs multimodal prompts using strands SDK ContentBlock format and supports both reference-based and reference-free evaluation across four dimensions: Overall Quality (P0), Correctness (P0), Faithfulness (P1), and Instruction Following (P1).

Key design decisions:

  • Extends OutputEvaluator to reuse rubric/model/system_prompt management
  • Built-in rubric templates + convenience subclasses per dimension
  • InputT=dict carries {"image": ImageData, "instruction": str}
  • ImageData supports file paths, base64, data URLs, bytes, PIL Images with JSON-safe serialization

Related Issues

Type of Change

  • New content

Checklist

  • I have read the CONTRIBUTING document
  • My changes follow the project's documentation style
  • I have tested the documentation locally using npm run dev
  • Links in the documentation are valid and working

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@sangminwoo sangminwoo marked this pull request as draft March 18, 2026 00:08
@sangminwoo sangminwoo marked this pull request as ready for review March 18, 2026 00:08

* `InputT=dict` is less type-safe than a dataclass (`MultimodalInput` TypedDict provides partial typing)
* Multimodal judge calls are more expensive/slower than text-only (image tokens cost more)
* Remote image sources (S3, HTTP URLs) require user to download before evaluation — no built-in fetching to avoid heavy dependencies (boto3, requests)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably support remote images in this:

# Define cases with image data in input dict
cases = [Case[dict, str](
    input={"image": ImageData(source="chart.png"), "instruction": "What is the revenue trend?"},
)]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. we can support HTTP URLs using urllib.request (stdlib), so no new dependency is needed. For S3 URIs, we can make boto3 an optional dependency:

  • HTTP/HTTPS: auto fetched via urllib.request
  • S3: auto fetched if boto3 is installed, error message otherwise

Does this makes sense to you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants