Conversation
MCP tools can return image content blocks, but these were only passed to the UI for display and never forwarded to the LLM in the follow-up turn. Since OpenAI's `role: "tool"` messages only accept text, inject a separate `role: "user"` message containing the image parts when the model supports multimodal input.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2a6780d301
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return undefined; | ||
| return { | ||
| type: "image_url", | ||
| image_url: { url: `data:${obj.mimeType};base64,${obj.data}`, detail: "auto" }, |
There was a problem hiding this comment.
Bound MCP image payloads before adding data URLs
This constructs image_url parts from raw MCP image blocks without any size or mime normalization, so a tool that returns a large/unsupported image (for example a full-resolution screenshot) will be forwarded verbatim and can cause the follow-up chat.completions.create call to fail on payload/image validation. In the same flow, user-uploaded images are constrained via makeImageProcessor (maxSizeInMB, width/height), so tool-returned images need equivalent checks or preprocessing before being appended to toolImages.
Useful? React with 👍 / 👎.
Summary
{ type: "image", data, mimeType }), but these were only displayed in the UI — never forwarded to the LLM in the follow-up turnrole: "tool"messages only acceptstring | Array<TextPart>, so a separaterole: "user"message is the only way to inject images into the LLM contexttoToolImageParthelper that converts MCPImageContentblocks into OpenAI-compatibleimage_urlparts, and injects them as a user message when the model supports multimodal inputChanges
toolInvocation.ts: NewToolImageParttype,toToolImagePart()converter, image extraction in the collation loop, placeholder text when output is empty but images existrunMcpFlow.ts: WhenmmEnabledand tool images are present, appends arole: "user"message with image parts after tool results; addstoolImageCountto loggerTest plan
npm run check— no type errorsnpm run lint— passes