Skip to content

Add screenshot capture for multimodal input#220

Open
aj47 wants to merge 15 commits intomainfrom
feature/screenshot-multimodal-input
Open

Add screenshot capture for multimodal input#220
aj47 wants to merge 15 commits intomainfrom
feature/screenshot-multimodal-input

Conversation

@aj47
Copy link
Owner

@aj47 aj47 commented Oct 31, 2025

Summary

This PR implements screenshot capture functionality for multimodal input, allowing users to include screenshots with their text messages when using vision-capable models.

Changes

UI Components

  • Added screenshot checkbox to text input panel with Camera icon
  • Added visual feedback for screenshot capture state (capturing, captured)
  • Screenshot is automatically captured when checkbox is toggled on
  • Screenshot is cleared after message submission

Backend Implementation

  • Implemented screenshot capture using Electron's desktopCapturer API
  • Exposed desktopCapturer API in preload script for renderer access
  • Added screenshot parameter throughout the processing pipeline:
    • createTextInput and createMcpTextInput TIPC procedures
    • processWithAgentMode function
    • processTranscriptWithAgentMode function

Multimodal Message Format

  • Updated message format to support both string and array content
  • Implemented OpenAI-compatible vision API format:
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "What's in this image?" },
        { "type": "image_url", "image_url": { "url": "data:image/png;base64,..." } }
      ]
    }
  • Updated LLM API calls to handle both string and array content formats
  • Added screenshot field to conversation history entries

Type Safety

  • Updated TypeScript types to support multimodal content
  • Updated message type signatures in llm-fetch.ts and llm.ts
  • All type checks pass successfully

Testing

  • ✅ TypeScript compilation passes
  • ✅ All existing tests pass
  • Manual testing recommended with vision-capable models (GPT-4V, Claude with vision, etc.)

Fixes

Closes #217

Notes

  • Screenshot capture uses the primary display by default
  • Screenshots are captured at 1920x1080 resolution
  • Base64-encoded screenshots are included directly in the API request
  • This feature works with any OpenAI-compatible vision model
  • Non-vision models will ignore the image data gracefully

Summary by CodeRabbit

  • New Features

    • Optional screenshot capture in the text input UI so users can submit screen images with text
    • Multimodal conversation support: LLM interactions now accept and render combined text and image content
    • Message content can include structured (non-string) data to support richer outputs
  • Chores

    • Renderer-preload now exposes a desktop capture API for obtaining screen sources from the environment

- Add screenshot checkbox to text input panel with Camera icon
- Implement screenshot capture using Electron's desktopCapturer API
- Update message format to support multimodal content (text + image)
- Add screenshot parameter throughout the processing pipeline
- Update LLM API calls to handle both string and array content formats
- Support OpenAI-compatible vision API format with base64 images
- Expose desktopCapturer API in preload script for renderer access

Fixes #217
@coderabbitai
Copy link

coderabbitai bot commented Oct 31, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

The changes add screenshot capture and multimodal support end-to-end: renderer UI captures screenshots, preload exposes desktopCapturer, IPC threads screenshot through tipc, LLM layers accept structured (text+image) message content and update token estimation and prompt construction for non-string content.

Changes

Cohort / File(s) Summary
Renderer UI & Input Pages
src/renderer/src/components/text-input-panel.tsx, src/renderer/src/pages/panel.tsx
Add UI toggle and capture state for screenshots, auto-capture behavior, Camera icon and status; update onSubmit and internal submit handlers to accept and forward optional screenshot alongside text; propagate screenshot through mutation functions and submit flow.
IPC / Agent Router
src/main/tipc.ts
Add optional screenshot?: string parameter to processWithAgentMode, update .input() signatures for createTextInput and createMcpTextInput, and thread screenshot through fallback and MCP-enabled paths to ensure downstream processing receives it.
LLM Multimodal Processing
src/main/llm.ts
Add optional screenshot parameter to processTranscriptWithAgentMode, propagate screenshot in conversation entries, and transform entries with screenshot into multimodal content parts (text + image_url) where prompts/progress updates are built.
LLM Fetch / Structured Content Support
src/main/llm-fetch.ts
Broaden message.content type to `string
Preload / Desktop Capture Exposure
src/preload/index.d.ts, src/preload/index.ts
Extend window.electron typing to include desktopCapturer.getSources(...) and expose api.desktopCapturer.getSources that delegates to electron.desktopCapturer.getSources.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant UI as TextInputPanel
    participant Panel as panel.tsx
    participant Preload as preload
    participant Main as main/tipc
    participant Agent as Agent Processing
    participant LLM as LLM API

    User->>UI: Enable "Include screenshot" + enter text
    UI->>Preload: api.desktopCapturer.getSources()
    Preload-->>UI: Screenshot DataURL
    User->>UI: Submit (text + screenshot)
    UI->>Panel: onSubmit(text, screenshot)
    Panel->>Main: createTextInput/createMcpTextInput({text, screenshot})
    Main->>Agent: processWithAgentMode(text, conversationId, screenshot)
    Agent->>Agent: Transform history -> [{type:'text', text}, {type:'image_url', image_url}]
    Agent->>LLM: Call LLM with multimodal message.content (array)
    LLM-->>Agent: Response
    Agent-->>Panel: Result
    Panel-->>UI: Clear input & screenshot
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Extra attention areas:
    • src/renderer/src/components/text-input-panel.tsx — verify useEffect and capture lifecycle, UI/UX edge cases.
    • src/main/llm.ts — ensure all prompt/progress mapping paths correctly handle array-formatted content.
    • src/main/llm-fetch.ts — confirm token estimation and Gemini prompt stringification match intended API payload semantics.
    • src/main/tipc.ts — confirm screenshot is threaded and optional/undefined handling is consistent.

Poem

🐰 I snapped a scene with nimble paws,
Sent pixels through the app’s soft laws,
Text and image hop in one array,
LLM listens, brightens up the day,
Rabbit cheers — a multimodal cause! 📸✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Linked Issues Check ⚠️ Warning The PR partially implements the requirements from issue #217 but leaves a significant gap. UI and multimodal integration requirements are met: screenshot checkbox with visual feedback added to TextInputPanel, system capture integrated via Electron's desktopCapturer, and LLM integration updated to handle multimodal content in OpenAI-compatible format. However, the PR does not implement agent settings configuration to enable/disable screenshot behavior or configure quality/format preferences, which are explicitly listed as acceptance criteria and primary objectives in the linked issue. The raw_summary shows no changes to any agent settings or configuration UI components. Consider either including agent settings configuration in this PR (checkbox to enable/disable screenshot per agent, quality/format options) or clarifying whether agent settings are intentionally deferred to a follow-up PR. If deferred, the linked issue's acceptance criteria should be updated to reflect that this PR only addresses UI screenshot capture and multimodal transmission, with agent-level configuration planned separately.
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "Add screenshot capture for multimodal input" clearly and concisely summarizes the primary changes in the changeset. The implementation adds screenshot capture UI components, integrates Electron's desktopCapturer API, propagates screenshot data through the processing pipeline, and updates LLM integration to support multimodal content. The title directly reflects these core additions without being vague or misleading.
Out of Scope Changes Check ✅ Passed All code changes in the PR are directly aligned with screenshot capture and multimodal input objectives. The modifications to llm-fetch.ts and llm.ts support multimodal content types, changes to tipc.ts propagate screenshot parameters through the processing pipeline, preload scripts expose the desktopCapturer API, and renderer components implement the UI and capture logic. No extraneous, unrelated, or tangential changes are present outside the scope of enabling screenshot capture for vision-capable models.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/screenshot-multimodal-input

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +64 to +69
const captureScreenshot = async () => {
setIsCapturingScreenshot(true)
try {
// Use Electron's desktopCapturer API to capture screenshot
const sources = await (window as any).electron.desktopCapturer.getSources({
types: ['screen'],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Wire screenshot API to the wrong global

The screenshot checkbox calls window.electron.desktopCapturer.getSources(...), but the preload script only exposes desktopCapturer on the custom electronAPI object (contextBridge.exposeInMainWorld("electronAPI", api)). window.electron still points to the default Electron toolkit API and has no desktopCapturer, so the call will throw and screenshots are never captured. Expose the capturer on window.electron or update the renderer to use window.electronAPI.desktopCapturer.

Useful? React with 👍 / 👎.

Comment on lines 85 to 88
const handleSubmit = () => {
if (text.trim() && !isProcessing) {
onSubmit(text.trim())
onSubmit(text.trim(), screenshot || undefined)
setText("")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Send screenshots even after user unchecks option

handleSubmit always forwards screenshot || undefined to onSubmit and then clears the state afterwards. The checkbox only toggles includeScreenshot; it never clears screenshot when the user turns the option off. If a screenshot was captured and the user unchecks the box before sending, the image is still submitted despite the UI indicating otherwise, which is a privacy risk. Either gate the submission on includeScreenshot or reset screenshot when the checkbox is unchecked.

Useful? React with 👍 / 👎.

@aj47
Copy link
Owner Author

aj47 commented Oct 31, 2025

augment review

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 4 suggestions posted.

@@ -10,6 +10,10 @@ const api = {
getOAuthStatus: (serverName: string) => ipcRenderer.invoke('getOAuthStatus', serverName),
revokeOAuthTokens: (serverName: string) => ipcRenderer.invoke('revokeOAuthTokens', serverName),
testMCPServer: (serverName: string, config: any) => ipcRenderer.invoke('testMCPServer', { serverName, config }),
// Screenshot API
desktopCapturer: {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The desktopCapturer API is exposed under window.electronAPI, but the renderer uses window.electron.desktopCapturer; this mismatch will make screenshot capture undefined at runtime. Consider aligning the exposure with the usage (related to the usage in text-input-panel.tsx).

🤖 React with 👍 or 👎 to let us know if the comment was useful, or 🚀 if it prevented an incident/outage.

setIsCapturingScreenshot(true)
try {
// Use Electron's desktopCapturer API to capture screenshot
const sources = await (window as any).electron.desktopCapturer.getSources({
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This references window.electron.desktopCapturer, but preload exposes desktopCapturer on window.electronAPI, so this will be undefined at runtime. Related to the exposure in src/preload/index.ts.

🤖 React with 👍 or 👎 to let us know if the comment was useful, or 🚀 if it prevented an incident/outage.

@@ -570,7 +570,10 @@ async function makeOpenAICompatibleCall(
}

const model = getModel(providerId, "mcp")
const estimatedTokens = Math.ceil(messages.reduce((sum, msg) => sum + msg.content.length, 0) / 4)
const estimatedTokens = Math.ceil(messages.reduce((sum, msg) => {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With multimodal messages (array content), the request logging in makeAPICallAttempt that truncates message content via substring assumes string content and will throw in debug mode. Consider making the debug logging robust to array content to avoid runtime errors when isDebugLLM() is true.

🤖 React with 👍 or 👎 to let us know if the comment was useful, or 🚀 if it prevented an incident/outage.

})

if (sources && sources.length > 0) {
// Get the first screen (primary display)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Selecting sources[0] as the "primary display" isn’t guaranteed by desktopCapturer; source ordering can vary across platforms. This can capture the wrong screen on multi‑monitor setups (also applies to other locations if replicated).

🤖 React with 👍 or 👎 to let us know if the comment was useful, or 🚀 if it prevented an incident/outage.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/main/llm-fetch.ts (1)

827-835: Handle multimodal responses without .trim() runtime errors

Upstream changes now allow messageObj.content to be an array (e.g., { type: "text" }, { type: "image_url" }). This line still assumes a string and calls .trim(), so the first multimodal assistant reply will crash with TypeError: rawContent.trim is not a function, aborting the whole agent run. Normalize array content before trimming.

-  let content: string | undefined = (messageObj.content ?? "").trim()
+  const rawContent = messageObj.content
+  let content: string | undefined
+
+  if (Array.isArray(rawContent)) {
+    content = rawContent
+      .map((part: any) => {
+        if (typeof part?.text === "string") return part.text
+        if (typeof part?.image_url?.url === "string") return `[image:${part.image_url.url}]`
+        return ""
+      })
+      .filter(Boolean)
+      .join("\n")
+      .trim()
+  } else {
+    content = (rawContent ?? "").trim()
+  }
🧹 Nitpick comments (2)
src/renderer/src/pages/panel.tsx (2)

135-156: Consider adding client-side validation for screenshot data.

The mutation accepts and forwards the screenshot to the backend without validation. Adding client-side checks would improve UX by catching issues earlier and providing clearer error messages.

Consider adding validation before the backend call:

 const textInputMutation = useMutation({
   mutationFn: async ({ text, screenshot }: { text: string; screenshot?: string }) => {
+    // Validate screenshot if provided
+    if (screenshot) {
+      if (!screenshot.startsWith('data:image/')) {
+        throw new Error('Invalid screenshot format. Please try capturing again.')
+      }
+      const sizeInMB = (screenshot.length * 0.75) / (1024 * 1024)
+      if (sizeInMB > 10) {
+        throw new Error(`Screenshot is too large (${sizeInMB.toFixed(1)}MB). Maximum size is 10MB.`)
+      }
+    }
     await tipcClient.createTextInput({ text, screenshot })
   },

158-193: Apply the same validation to MCP text input mutation.

Similar to textInputMutation, this mutation should validate the screenshot parameter before sending to the backend to catch errors early and provide better UX.

Apply the same validation pattern suggested for textInputMutation at lines 135-156.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 881512f and 9ad19c9.

📒 Files selected for processing (7)
  • src/main/llm-fetch.ts (6 hunks)
  • src/main/llm.ts (5 hunks)
  • src/main/tipc.ts (5 hunks)
  • src/preload/index.d.ts (1 hunks)
  • src/preload/index.ts (2 hunks)
  • src/renderer/src/components/text-input-panel.tsx (4 hunks)
  • src/renderer/src/pages/panel.tsx (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/main/tipc.ts (1)
src/main/config.ts (1)
  • configStore (142-142)
src/renderer/src/pages/panel.tsx (1)
src/renderer/src/lib/tipc-client.ts (1)
  • tipcClient (12-14)
🔇 Additional comments (3)
src/main/tipc.ts (2)

619-688: Screenshot parameter properly threaded through agent mode flow.

The implementation correctly forwards the screenshot parameter through the MCP text input flow to processWithAgentMode. The screenshot will be included in the conversation history and multimodal LLM prompts as described in the PR objectives.

Note: Ensure screenshot validation is added as suggested in the earlier comment on processWithAgentMode.


566-617: Screenshot parameter accepted but unused in createTextInput function.

The createTextInput function accepts a screenshot parameter (line 569) but never uses it anywhere in the function body. The screenshot is passed to this function when mcpToolsEnabled is false (line 630), but it's silently discarded.

Screenshots are only processed in the MCP flow via processWithAgentMode, where they're sent to LLMs as multimodal content. The fallback text-input path doesn't persist or process screenshots despite accepting them in the input type definition.

Recommendation: Either remove screenshot from the createTextInput input type since it's not used, or document this limitation. Alternatively, if screenshots should be stored for non-MCP mode, add a screenshot field to RecordingHistoryItem and persist them.

src/renderer/src/pages/panel.tsx (1)

352-375: Screenshot parameter properly propagated through submission flow.

The implementation correctly forwards the screenshot parameter to both MCP and standard text input mutations, ensuring it reaches the backend processing layer. The conversation message creation appropriately includes only the text, keeping the screenshot as context for the LLM rather than display content.

src/main/tipc.ts Outdated
Comment on lines 49 to 53
async function processWithAgentMode(
text: string,
conversationId?: string,
screenshot?: string,
): Promise<string> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add validation for screenshot parameter.

The screenshot parameter accepts base64-encoded image data that can be 2-3MB for the configured 1920x1080 resolution. Without validation, this could lead to memory issues or API payload size limit violations downstream.

Consider adding validation before processing:

 async function processWithAgentMode(
   text: string,
   conversationId?: string,
   screenshot?: string,
 ): Promise<string> {
+  // Validate screenshot if provided
+  if (screenshot) {
+    // Check format (should be data URL)
+    if (!screenshot.startsWith('data:image/')) {
+      throw new Error('Invalid screenshot format')
+    }
+    // Check size (e.g., 10MB limit)
+    const sizeInMB = (screenshot.length * 0.75) / (1024 * 1024) // approximate base64 size
+    if (sizeInMB > 10) {
+      throw new Error(`Screenshot too large: ${sizeInMB.toFixed(1)}MB (max 10MB)`)
+    }
+  }
+
   const config = configStore.get()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async function processWithAgentMode(
text: string,
conversationId?: string,
screenshot?: string,
): Promise<string> {
async function processWithAgentMode(
text: string,
conversationId?: string,
screenshot?: string,
): Promise<string> {
// Validate screenshot if provided
if (screenshot) {
// Check format (should be data URL)
if (!screenshot.startsWith('data:image/')) {
throw new Error('Invalid screenshot format')
}
// Check size (e.g., 10MB limit)
const sizeInMB = (screenshot.length * 0.75) / (1024 * 1024) // approximate base64 size
if (sizeInMB > 10) {
throw new Error(`Screenshot too large: ${sizeInMB.toFixed(1)}MB (max 10MB)`)
}
}
const config = configStore.get()

Comment on lines +13 to +16
// Screenshot API
desktopCapturer: {
getSources: (options: any) => desktopCapturer.getSources(options)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Expose desktopCapturer on the object the renderer actually uses

contextBridge.exposeInMainWorld("electron", electronAPI) still publishes the unmodified toolkit API, so the newly added screenshot helper only exists on window.electronAPI.desktopCapturer. The renderer code calls window.electron.desktopCapturer.getSources(...), which will be undefined at runtime—screenshot capture never fires and throws. Please merge the desktopCapturer proxy into the object you expose as window.electron so the preload and renderer stay in sync.

 const api = {
@@
-  desktopCapturer: {
-    getSources: (options: any) => desktopCapturer.getSources(options)
-  }
+  desktopCapturer: {
+    getSources: (options: any) => desktopCapturer.getSources(options)
+  }
 }
 
+const electronWithDesktopCapturer = {
+  ...electronAPI,
+  desktopCapturer: api.desktopCapturer,
+}
+
 if (process.contextIsolated) {
   try {
-    contextBridge.exposeInMainWorld("electron", electronAPI)
+    contextBridge.exposeInMainWorld("electron", electronWithDesktopCapturer)
     contextBridge.exposeInMainWorld("electronAPI", api)
@@
-  window.electron = electronAPI
+  window.electron = electronWithDesktopCapturer
   window.electronAPI = api
 }

Committable suggestion skipped: line range outside the PR's diff.

- Only send screenshot when includeScreenshot is true
- Clear screenshot data when user unchecks the box
- Prevents unintended screenshot transmission
- Respects user privacy expectations
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
src/renderer/src/components/text-input-panel.tsx (3)

68-68: Still using wrong API path - critical runtime error.

Past reviews flagged that window.electron.desktopCapturer is undefined; the preload script exposes it on window.electronAPI.desktopCapturer. This will throw at runtime and prevent screenshot capture.

Apply this diff:

-      const sources = await (window as any).electron.desktopCapturer.getSources({
+      const sources = await (window as any).electronAPI.desktopCapturer.getSources({

74-75: Primary display assumption unreliable on multi-monitor setups.

desktopCapturer doesn't guarantee source ordering, so sources[0] may capture a secondary screen. Consider filtering for the primary display or documenting this limitation.


96-103: Missing dependency in useEffect.

The effect references screenshot in line 97 and 99 but doesn't include it in the dependency array. React's exhaustive-deps rule flags this. Add screenshot to the dependencies or use a callback to avoid potential stale closure issues.

-  }, [includeScreenshot])
+  }, [includeScreenshot, screenshot])

Note: You may also need to add isCapturingScreenshot as a guard to prevent re-capture loops.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ad19c9 and dfe0269.

📒 Files selected for processing (1)
  • src/renderer/src/components/text-input-panel.tsx (4 hunks)
🔇 Additional comments (3)
src/renderer/src/components/text-input-panel.tsx (3)

85-93: Privacy concern properly addressed.

The submission now correctly respects the includeScreenshot toggle and only sends the screenshot when both the checkbox is checked AND a screenshot exists. This resolves the privacy issue flagged in previous reviews.


96-103: Screenshot clearing properly implemented.

The effect correctly clears the screenshot when the user unchecks the box, addressing the privacy concern from previous reviews.


193-212: UI implementation looks good.

The screenshot checkbox, status indicators, and disabled states are properly implemented with clear visual feedback.

Comment on lines +78 to +82
} catch (error) {
console.error('Failed to capture screenshot:', error)
} finally {
setIsCapturingScreenshot(false)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Provide user feedback when screenshot capture fails.

Errors are logged but users see no indication that capture failed. The checkbox stays checked and shows "Capturing..." indefinitely if an error occurs, creating a confusing state.

Consider showing a brief error message or toast notification when capture fails.

🤖 Prompt for AI Agents
In src/renderer/src/components/text-input-panel.tsx around lines 78 to 82, the
catch block only logs the error so the UI stays in a "Capturing..." state and
the checkbox remains checked; update the catch to set the capturing state back
to false (or ensure finally still runs) and surface a brief user-facing error
(e.g., trigger an existing toast/notification or set a local error message state
that displays a short inline message) so users know the capture failed; keep the
console.error for debugging but call the toast/notification helper (or setError)
with a friendly message like "Screenshot failed" so the checkbox and label are
restored and the user receives feedback.

@aj47
Copy link
Owner Author

aj47 commented Nov 1, 2025

augment review

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Resolves merge conflicts from monorepo restructure:
- Fixed content conflict in llm-fetch.ts (combined multimodal content type with sessionId/onRetryProgress)
- Applied screenshot/multimodal changes to new file locations:
  - apps/desktop/src/main/llm.ts: Added screenshot parameter and multimodal message handling
  - apps/desktop/src/main/tipc.ts: Added screenshot parameter to processWithAgentMode and router
  - apps/desktop/src/renderer/src/pages/panel.tsx: Added screenshot to mutations and handleTextSubmit
- Removed old src/ files that were deleted in main but modified in PR

Fixes #671
@aj47 aj47 added the slot-1 label Dec 20, 2025
@aj47
Copy link
Owner Author

aj47 commented Dec 20, 2025

augment review

@augmentcode
Copy link

augmentcode bot commented Dec 20, 2025

🤖 Augment PR Summary

Summary: Adds optional screenshot capture so users can send multimodal (text + image) prompts to vision-capable models.

Changes:

  • Renderer: adds an “Include screenshot” toggle to the text input panel, capture state feedback, and a preview dialog.
  • Settings: introduces screenshot preferences (always include for text input, auto-capture for voice commands, and display selection for multi-monitor setups).
  • Main/preload: exposes screen capture via IPC (main-process desktopCapturer handler + preload bridge) and adds a helper to capture screenshots from main.
  • TIPC/agent pipeline: threads an optional screenshot data URL through text/MCP entrypoints into agent-mode processing.
  • LLM layer: supports OpenAI-style multimodal content arrays and updates context/logging utilities to tolerate non-string message content.

Technical Notes: Screenshots are sent as data:image/… URLs (default 1920×1080) with basic size/format validation, and macOS permission failures are surfaced with actionable errors.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

// Clear screenshot when user unchecks the box
setScreenshot(null)
}
}, [includeScreenshot])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this effect only depends on includeScreenshot, an in-flight captureScreenshot() can still call setScreenshot(...) after the user unchecks the box, leaving a “captured” image in state while the toggle is off. Consider making the capture lifecycle resilient to that race so the UI state can’t re-populate a screenshot after opt-out.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

// Convert messages to Gemini format
const prompt = messages.map((m) => `${m.role}: ${m.content}`).join("\n\n")
const prompt = messages.map((m) => {
const content = typeof m.content === 'string' ? m.content : JSON.stringify(m.content)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Gemini, JSON.stringify(m.content) will embed the full screenshot data URL into the text prompt, which is likely to blow past payload/token limits and isn’t Gemini’s supported multimodal format. Consider stripping image parts (or mapping them to Gemini-native parts) when the provider isn’t OpenAI-compatible vision.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

aj47 added 4 commits December 20, 2025 15:40
In Electron 31+, desktopCapturer is only available in the main process.
This was causing the preload script to fail to load, breaking the entire
UI because window.electron.ipcRenderer was undefined.

Changes:
- Remove desktopCapturer import from preload script
- Add IPC handler 'getScreenSources' in main process
- Update text-input-panel to use new IPC-based API
- Update type definitions for the new API
- Add captureWantedRef to prevent setting screenshot state after user unchecks
- Add screenshotError state to show user feedback when capture fails
- Uncheck the checkbox automatically when capture fails
- Add proper dependencies to useEffect for React best practices

Addresses PR review comments about race conditions and missing error feedback.
- Add extractTextFromContent helper to avoid embedding full image data URLs
- Replace image parts with '[image attached]' placeholder for Gemini
- Prevents payload/token limit issues when screenshots are in messages

Addresses PR review comment about Gemini multimodal content handling.
- Validate screenshot format (must be data:image/ data URL)
- Validate screenshot size (maximum 10MB)
- Throw descriptive errors for invalid screenshots

Addresses PR review comment about screenshot validation.
@aj47
Copy link
Owner Author

aj47 commented Dec 20, 2025

augment review

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

// Pass existingSessionId to reuse the session if found
// When fromTile=true, start snoozed so the floating panel doesn't appear
processWithAgentMode(input.text, conversationId, existingSessionId, input.fromTile ?? false)
processWithAgentMode(input.text, conversationId, existingSessionId, input.fromTile ?? false, input.screenshot)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fire-and-forget promise chain without a .catch(), so the new screenshot validation errors (and other failures inside processWithAgentMode) can surface as unhandled promise rejections. Consider ensuring failures are caught/logged (and ideally surfaced to the UI) to avoid silent session failures.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

*/
async function makeLLMCallAttempt(
messages: Array<{ role: string; content: string }>,
messages: Array<{ role: string; content: string | any[] }>,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makeLLMCallAttempt now allows content to be an array, but the debug log inside this function still does messages[...].content.substring(...) which will throw when isDebugLLM() is enabled and a screenshot is attached. The debug/logging path should tolerate non-string content values.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

aj47 added 2 commits December 20, 2025 16:41
- Fix macOS Screen Recording permission detection (empty sources array)
- Add helper functions to handle multimodal content (string or array)
- Add detail: high to image_url for xAI/Grok vision support
- Improve error handling in text-input-panel for screenshot errors
… monitor selection

- Add screenshot support for voice/recording input (passes through triggerMcpRecording and createMcpRecording)
- Add 'Always include screenshot' setting in Settings > Shortcuts > Text Input
- Add preview button (eye icon) next to screenshot checkbox to preview before sending
- Add monitor/display selection in Settings > Screenshot section
- Screenshot capture now uses configured display with fallback to primary
@aj47
Copy link
Owner Author

aj47 commented Dec 23, 2025

augment review

@augmentcode
Copy link

augmentcode bot commented Dec 23, 2025

This pull request is too large for Augment to review. The PR exceeds the maximum size limit of 100000 tokens (approximately 400000 characters) for automated code review. Please consider breaking this PR into smaller, more focused changes.

@aj47
Copy link
Owner Author

aj47 commented Dec 23, 2025

augment review

@augmentcode
Copy link

augmentcode bot commented Dec 23, 2025

This pull request is too large for Augment to review. The PR exceeds the maximum size limit of 100000 tokens (approximately 400000 characters) for automated code review. Please consider breaking this PR into smaller, more focused changes.

@aj47
Copy link
Owner Author

aj47 commented Dec 23, 2025

augment review

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

onSubmit(text.trim(), screenshotToSend)
setText("")
setScreenshot(null)
setIncludeScreenshot(false)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleSubmit() always does setIncludeScreenshot(false), which means the alwaysIncludeScreenshot setting won’t persist across submissions (since the sync effect only reruns when alwaysIncludeScreenshot changes). Consider resetting includeScreenshot back to alwaysIncludeScreenshot after submit so the config is consistently honored.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

})

if (sources.length === 0) {
logApp('[captureScreenshotFromMain] No screen sources available')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On macOS, desktopCapturer.getSources() can return an empty array when Screen Recording permission isn’t granted; captureScreenshotFromMain() currently just logs and returns undefined, so screenshotForVoiceCommands can silently do nothing. Consider surfacing a clearer error path/message (similar to the IPC handler) so users know they need to grant permission and restart.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

- Reset includeScreenshot to alwaysIncludeScreenshot config value after submit
  (instead of always resetting to false)
- Improve captureScreenshotFromMain error logging for macOS permission issues
  to help users understand they need to grant Screen Recording permission
@aj47
Copy link
Owner Author

aj47 commented Dec 23, 2025

augment review

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

const mapConversationToMessages = (
addSummaryPrompt: boolean = false
): Array<{ role: "user" | "assistant"; content: string }> => {
): Array<{ role: "user" | "assistant"; content: string | any[] }> => {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mapConversationToMessages now emits content as string | any[] for screenshot messages, but downstream context-budgeting (shrinkMessagesForLLM / LLMMessage in context-budget.ts) still assumes content is a string and uses string-only operations. This can break agent runs when a screenshot is attached and context reduction is enabled; consider normalizing multimodal content before context reduction (or extending the reducer to handle multimodal arrays).

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

logApp("IPC main registered")

// Register desktopCapturer handler (available only in main process in Electron 31+)
ipcMain.handle('getScreenSources', async (_event, options: { types: ('screen' | 'window')[], thumbnailSize?: { width: number, height: number } }) => {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getScreenSources forwards renderer-supplied options directly into desktopCapturer.getSources; consider validating/clamping types and thumbnailSize so a renderer bug (or injected script) can’t request unexpected capture modes or extreme sizes. Even if the UI currently uses only screen at 1920×1080, tightening this IPC surface helps prevent accidental DoS/privacy regressions.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

- Fix captureScreenshotFromMain to throw errors instead of silently returning undefined
  - On macOS, throws clear error about Screen Recording permission
  - Callers now properly handle errors with try-catch

- Fix context-budget.ts to handle multimodal content (string | any[])
  - Add getContentLength() helper for multimodal content length calculation
  - Add getContentAsString() helper for extracting text from multimodal content
  - Update LLMMessage type to support multimodal content
  - Update all content operations to use helper functions

- Add input validation to getScreenSources IPC handler
  - Filter types to only allow 'screen' (not 'window') for privacy
  - Clamp thumbnailSize to reasonable bounds (100-4096 pixels)
  - Set sensible defaults if options not provided
@aj47
Copy link
Owner Author

aj47 commented Dec 23, 2025

augment review

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

@aj47
Copy link
Owner Author

aj47 commented Dec 23, 2025

augment review

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

<DialogHeader>
<DialogTitle>Screenshot Preview</DialogTitle>
<DialogDescription>
This is what will be sent with your message
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preview dialog says “This is what will be sent with your message”, but handlePreviewScreenshot() captures a fresh image into previewImage without updating the screenshot state used by submit, so the preview can diverge from the actual payload (potential privacy surprise). Consider ensuring the preview reflects the same screenshot that will be submitted.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

if (!content) return null

// Handle multimodal content (text + screenshot)
if (entry.screenshot) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because screenshots are stored on conversationHistory entries and emitted into LLM messages, follow-up turns will resend all prior screenshots in the conversation. This can grow request payloads quickly (even with a per-image cap) and may hit provider/request-size limits on longer sessions; consider how/when older screenshots should be dropped or summarized.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

aj47 added 2 commits December 22, 2025 19:51
Address PR review comment about preview divergence. The preview dialog now
uses the existing screenshot state instead of capturing a fresh one, ensuring
what users see matches what will actually be sent with their message.

- Use existing screenshot state for preview if available
- Only capture fresh screenshot when none exists
- Update screenshot state after capture to keep preview in sync
Address PR review comment about consistent error logging. The catch handler
for the fire-and-forget processWithAgentMode call now uses logApp for
consistency with other error logging in the tipc module.
@aj47
Copy link
Owner Author

aj47 commented Dec 23, 2025

augment review

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.


/**
* Capture a screenshot from the configured display (or primary display)
* Returns the screenshot as a data URL, or undefined if capture fails
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc comment says capture failures return undefined, but the implementation logs and then rethrows, so callers need to handle exceptions rather than checking for undefined. Consider aligning the comment with the actual behavior to avoid misuse.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add screenshot as context option for SpeakMCP input

1 participant