Skip to content

Fix MCP tool images not reaching follow-up LLM context#2167

Open
Rishwanth1323 wants to merge 2 commits intohuggingface:mainfrom
Rishwanth1323:fix/mcp-tool-image-context
Open

Fix MCP tool images not reaching follow-up LLM context#2167
Rishwanth1323 wants to merge 2 commits intohuggingface:mainfrom
Rishwanth1323:fix/mcp-tool-image-context

Conversation

@Rishwanth1323
Copy link

@Rishwanth1323 Rishwanth1323 commented Mar 5, 2026

Title
Fix MCP tool images not reaching follow-up LLM context

Summary
This PR fixes a gap in the MCP tool flow where image outputs (for example, screenshots) were visible in the UI tool panel but were not forwarded to the model in the next reasoning step. As a result, the assistant often responded as if it could not see the image.

With this change, MCP tool-returned images are converted to multimodal image_url parts and injected into the follow-up model turn (when multimodal is enabled), so the assistant can actually analyze the screenshot it just received.

Problem

  • MCP tools can return structured content blocks including images (type: "image" with base64 data).
  • UI rendered those image blocks correctly in tool output preview.
  • Server-side follow-up passed only text from tool output into role: "tool" messages.
  • If tool returned image-only output (or image + minimal text), LLM context missed the visual data.

Root Cause
The MCP execution pipeline preserved image blocks for UI display but dropped them from the LLM follow-up input path.

What Changed

  1. Added extraction/mapping of MCP image blocks into OpenAI-compatible multimodal parts:
  • type: "image_url" with data:<mime>;base64,<payload>
  1. Extended tool execution summary to include collected image parts (toolImages) in addition to existing toolMessages and toolRuns.

  2. In MCP follow-up loop, when multimodal is enabled, appended a follow-up context message containing:

  • short text instruction
  • all collected tool-returned images as multimodal parts
  1. Added safe fallback for image-only tool outputs:
  • if textual output is empty but images exist, emit a minimal text tool message (Tool returned N image(s).) for compatibility.

Behavior After Fix

  • Tool preview in UI remains unchanged.
  • Follow-up model now receives screenshot/image context and can answer questions about image content.
  • For non-multimodal models, image parts are not injected (existing capability gating remains intact).

Files Changed

  • src/lib/server/textGeneration/mcp/toolInvocation.ts
  • src/lib/server/textGeneration/mcp/runMcpFlow.ts

Validation

  • npm run check passes (no TypeScript/Svelte errors).
  • Verified manually with screenshot MCP tool flow:
    • tool call succeeds
    • image preview is shown
    • assistant follow-up now uses screenshot context for analysis.

Notes

  • No new tests were added in this PR (this MCP path currently has limited direct test coverage).

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9f4e31100f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant