Add screenshot capture for multimodal input by aj47 · Pull Request #220 · aj47/SpeakMCP

aj47 · 2025-10-31T18:44:26Z

Summary

This PR implements screenshot capture functionality for multimodal input, allowing users to include screenshots with their text messages when using vision-capable models.

Changes

UI Components

Added screenshot checkbox to text input panel with Camera icon
Added visual feedback for screenshot capture state (capturing, captured)
Screenshot is automatically captured when checkbox is toggled on
Screenshot is cleared after message submission

Backend Implementation

Implemented screenshot capture using Electron's desktopCapturer API
Exposed desktopCapturer API in preload script for renderer access
Added screenshot parameter throughout the processing pipeline:
- createTextInput and createMcpTextInput TIPC procedures
- processWithAgentMode function
- processTranscriptWithAgentMode function

Multimodal Message Format

Updated message format to support both string and array content

Implemented OpenAI-compatible vision API format:

{
  "role": "user",
  "content": [
    { "type": "text", "text": "What's in this image?" },
    { "type": "image_url", "image_url": { "url": "data:image/png;base64,..." } }
  ]
}

Updated LLM API calls to handle both string and array content formats
Added screenshot field to conversation history entries

Type Safety

Updated TypeScript types to support multimodal content
Updated message type signatures in llm-fetch.ts and llm.ts
All type checks pass successfully

Testing

✅ TypeScript compilation passes
✅ All existing tests pass
Manual testing recommended with vision-capable models (GPT-4V, Claude with vision, etc.)

Fixes

Closes #217

Notes

Screenshot capture uses the primary display by default
Screenshots are captured at 1920x1080 resolution
Base64-encoded screenshots are included directly in the API request
This feature works with any OpenAI-compatible vision model
Non-vision models will ignore the image data gracefully

Summary by CodeRabbit

New Features
- Optional screenshot capture in the text input UI so users can submit screen images with text
- Multimodal conversation support: LLM interactions now accept and render combined text and image content
- Message content can include structured (non-string) data to support richer outputs
Chores
- Renderer-preload now exposes a desktop capture API for obtaining screen sources from the environment

- Add screenshot checkbox to text input panel with Camera icon - Implement screenshot capture using Electron's desktopCapturer API - Update message format to support multimodal content (text + image) - Add screenshot parameter throughout the processing pipeline - Update LLM API calls to handle both string and array content formats - Support OpenAI-compatible vision API format with base64 images - Expose desktopCapturer API in preload script for renderer access Fixes #217

coderabbitai · 2025-10-31T18:44:37Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

The changes add screenshot capture and multimodal support end-to-end: renderer UI captures screenshots, preload exposes desktopCapturer, IPC threads screenshot through tipc, LLM layers accept structured (text+image) message content and update token estimation and prompt construction for non-string content.

Changes

Cohort / File(s)	Summary
Renderer UI & Input Pages `src/renderer/src/components/text-input-panel.tsx`, `src/renderer/src/pages/panel.tsx`	Add UI toggle and capture state for screenshots, auto-capture behavior, Camera icon and status; update `onSubmit` and internal submit handlers to accept and forward optional `screenshot` alongside `text`; propagate screenshot through mutation functions and submit flow.
IPC / Agent Router `src/main/tipc.ts`	Add optional `screenshot?: string` parameter to `processWithAgentMode`, update `.input()` signatures for `createTextInput` and `createMcpTextInput`, and thread screenshot through fallback and MCP-enabled paths to ensure downstream processing receives it.
LLM Multimodal Processing `src/main/llm.ts`	Add optional `screenshot` parameter to `processTranscriptWithAgentMode`, propagate screenshot in conversation entries, and transform entries with screenshot into multimodal content parts (text + image_url) where prompts/progress updates are built.
LLM Fetch / Structured Content Support `src/main/llm-fetch.ts`	Broaden `message.content` type to `string
Preload / Desktop Capture Exposure `src/preload/index.d.ts`, `src/preload/index.ts`	Extend `window.electron` typing to include `desktopCapturer.getSources(...)` and expose `api.desktopCapturer.getSources` that delegates to `electron.desktopCapturer.getSources`.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant UI as TextInputPanel
    participant Panel as panel.tsx
    participant Preload as preload
    participant Main as main/tipc
    participant Agent as Agent Processing
    participant LLM as LLM API

    User->>UI: Enable "Include screenshot" + enter text
    UI->>Preload: api.desktopCapturer.getSources()
    Preload-->>UI: Screenshot DataURL
    User->>UI: Submit (text + screenshot)
    UI->>Panel: onSubmit(text, screenshot)
    Panel->>Main: createTextInput/createMcpTextInput({text, screenshot})
    Main->>Agent: processWithAgentMode(text, conversationId, screenshot)
    Agent->>Agent: Transform history -> [{type:'text', text}, {type:'image_url', image_url}]
    Agent->>LLM: Call LLM with multimodal message.content (array)
    LLM-->>Agent: Response
    Agent-->>Panel: Result
    Panel-->>UI: Clear input & screenshot

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Extra attention areas:
- src/renderer/src/components/text-input-panel.tsx — verify useEffect and capture lifecycle, UI/UX edge cases.
- src/main/llm.ts — ensure all prompt/progress mapping paths correctly handle array-formatted content.
- src/main/llm-fetch.ts — confirm token estimation and Gemini prompt stringification match intended API payload semantics.
- src/main/tipc.ts — confirm screenshot is threaded and optional/undefined handling is consistent.

Poem

🐰 I snapped a scene with nimble paws,
Sent pixels through the app’s soft laws,
Text and image hop in one array,
LLM listens, brightens up the day,
Rabbit cheers — a multimodal cause! 📸✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Linked Issues Check	⚠️ Warning	The PR partially implements the requirements from issue #217 but leaves a significant gap. UI and multimodal integration requirements are met: screenshot checkbox with visual feedback added to TextInputPanel, system capture integrated via Electron's desktopCapturer, and LLM integration updated to handle multimodal content in OpenAI-compatible format. However, the PR does not implement agent settings configuration to enable/disable screenshot behavior or configure quality/format preferences, which are explicitly listed as acceptance criteria and primary objectives in the linked issue. The raw_summary shows no changes to any agent settings or configuration UI components.	Consider either including agent settings configuration in this PR (checkbox to enable/disable screenshot per agent, quality/format options) or clarifying whether agent settings are intentionally deferred to a follow-up PR. If deferred, the linked issue's acceptance criteria should be updated to reflect that this PR only addresses UI screenshot capture and multimodal transmission, with agent-level configuration planned separately.
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The PR title "Add screenshot capture for multimodal input" clearly and concisely summarizes the primary changes in the changeset. The implementation adds screenshot capture UI components, integrates Electron's desktopCapturer API, propagates screenshot data through the processing pipeline, and updates LLM integration to support multimodal content. The title directly reflects these core additions without being vague or misleading.
Out of Scope Changes Check	✅ Passed	All code changes in the PR are directly aligned with screenshot capture and multimodal input objectives. The modifications to llm-fetch.ts and llm.ts support multimodal content types, changes to tipc.ts propagate screenshot parameters through the processing pipeline, preload scripts expose the desktopCapturer API, and renderer components implement the UI and capture logic. No extraneous, unrelated, or tangential changes are present outside the scope of enabling screenshot capture for vision-capable models.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/screenshot-multimodal-input

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-10-31T18:47:31Z

apps/desktop/src/renderer/src/components/text-input-panel.tsx

+  const captureScreenshot = async () => {
+    setIsCapturingScreenshot(true)
+    try {
+      // Use Electron's desktopCapturer API to capture screenshot
+      const sources = await (window as any).electron.desktopCapturer.getSources({
+        types: ['screen'],


Wire screenshot API to the wrong global

The screenshot checkbox calls window.electron.desktopCapturer.getSources(...), but the preload script only exposes desktopCapturer on the custom electronAPI object (contextBridge.exposeInMainWorld("electronAPI", api)). window.electron still points to the default Electron toolkit API and has no desktopCapturer, so the call will throw and screenshots are never captured. Expose the capturer on window.electron or update the renderer to use window.electronAPI.desktopCapturer.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-10-31T18:47:31Z

src/renderer/src/components/text-input-panel.tsx

  const handleSubmit = () => {
    if (text.trim() && !isProcessing) {
-      onSubmit(text.trim())
+      onSubmit(text.trim(), screenshot || undefined)
      setText("")


Send screenshots even after user unchecks option

handleSubmit always forwards screenshot || undefined to onSubmit and then clears the state afterwards. The checkbox only toggles includeScreenshot; it never clears screenshot when the user turns the option off. If a screenshot was captured and the user unchecks the box before sending, the image is still submitted despite the UI indicating otherwise, which is a privacy risk. Either gate the submission on includeScreenshot or reset screenshot when the checkbox is unchecked.

Useful? React with 👍 / 👎.

aj47 · 2025-10-31T18:48:27Z

augment review

augmentcode

Review completed. 4 suggestions posted.

augmentcode · 2025-10-31T18:48:30Z

apps/desktop/src/preload/index.ts

@@ -10,6 +10,10 @@ const api = {
  getOAuthStatus: (serverName: string) => ipcRenderer.invoke('getOAuthStatus', serverName),
  revokeOAuthTokens: (serverName: string) => ipcRenderer.invoke('revokeOAuthTokens', serverName),
  testMCPServer: (serverName: string, config: any) => ipcRenderer.invoke('testMCPServer', { serverName, config }),
+  // Screenshot API
+  desktopCapturer: {


The desktopCapturer API is exposed under window.electronAPI, but the renderer uses window.electron.desktopCapturer; this mismatch will make screenshot capture undefined at runtime. Consider aligning the exposure with the usage (related to the usage in text-input-panel.tsx).

🤖 React with 👍 or 👎 to let us know if the comment was useful, or 🚀 if it prevented an incident/outage.

augmentcode · 2025-10-31T18:48:30Z

apps/desktop/src/renderer/src/components/text-input-panel.tsx

+    setIsCapturingScreenshot(true)
+    try {
+      // Use Electron's desktopCapturer API to capture screenshot
+      const sources = await (window as any).electron.desktopCapturer.getSources({


This references window.electron.desktopCapturer, but preload exposes desktopCapturer on window.electronAPI, so this will be undefined at runtime. Related to the exposure in src/preload/index.ts.

🤖 React with 👍 or 👎 to let us know if the comment was useful, or 🚀 if it prevented an incident/outage.

augmentcode · 2025-10-31T18:48:30Z

apps/desktop/src/main/llm-fetch.ts

@@ -570,7 +570,10 @@ async function makeOpenAICompatibleCall(
  }

  const model = getModel(providerId, "mcp")
-  const estimatedTokens = Math.ceil(messages.reduce((sum, msg) => sum + msg.content.length, 0) / 4)
+  const estimatedTokens = Math.ceil(messages.reduce((sum, msg) => {


With multimodal messages (array content), the request logging in makeAPICallAttempt that truncates message content via substring assumes string content and will throw in debug mode. Consider making the debug logging robust to array content to avoid runtime errors when isDebugLLM() is true.

🤖 React with 👍 or 👎 to let us know if the comment was useful, or 🚀 if it prevented an incident/outage.

augmentcode · 2025-10-31T18:48:30Z

apps/desktop/src/renderer/src/components/text-input-panel.tsx

+      })
+
+      if (sources && sources.length > 0) {
+        // Get the first screen (primary display)


Selecting sources[0] as the "primary display" isn’t guaranteed by desktopCapturer; source ordering can vary across platforms. This can capture the wrong screen on multi‑monitor setups (also applies to other locations if replicated).

🤖 React with 👍 or 👎 to let us know if the comment was useful, or 🚀 if it prevented an incident/outage.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/main/llm-fetch.ts (1)
827-835: Handle multimodal responses without .trim() runtime errors

Upstream changes now allow messageObj.content to be an array (e.g., { type: "text" }, { type: "image_url" }). This line still assumes a string and calls .trim(), so the first multimodal assistant reply will crash with TypeError: rawContent.trim is not a function, aborting the whole agent run. Normalize array content before trimming.
-  let content: string | undefined = (messageObj.content ?? "").trim()
+  const rawContent = messageObj.content
+  let content: string | undefined
+
+  if (Array.isArray(rawContent)) {
+    content = rawContent
+      .map((part: any) => {
+        if (typeof part?.text === "string") return part.text
+        if (typeof part?.image_url?.url === "string") return `[image:${part.image_url.url}]`
+        return ""
+      })
+      .filter(Boolean)
+      .join("\n")
+      .trim()
+  } else {
+    content = (rawContent ?? "").trim()
+  }

🧹 Nitpick comments (2)

src/renderer/src/pages/panel.tsx (2)
135-156: Consider adding client-side validation for screenshot data.

The mutation accepts and forwards the screenshot to the backend without validation. Adding client-side checks would improve UX by catching issues earlier and providing clearer error messages.

Consider adding validation before the backend call:
 const textInputMutation = useMutation({
   mutationFn: async ({ text, screenshot }: { text: string; screenshot?: string }) => {
+    // Validate screenshot if provided
+    if (screenshot) {
+      if (!screenshot.startsWith('data:image/')) {
+        throw new Error('Invalid screenshot format. Please try capturing again.')
+      }
+      const sizeInMB = (screenshot.length * 0.75) / (1024 * 1024)
+      if (sizeInMB > 10) {
+        throw new Error(`Screenshot is too large (${sizeInMB.toFixed(1)}MB). Maximum size is 10MB.`)
+      }
+    }
     await tipcClient.createTextInput({ text, screenshot })
   },
158-193: Apply the same validation to MCP text input mutation.

Similar to textInputMutation, this mutation should validate the screenshot parameter before sending to the backend to catch errors early and provide better UX.

Apply the same validation pattern suggested for textInputMutation at lines 135-156.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 881512f and 9ad19c9.

📒 Files selected for processing (7)

src/main/llm-fetch.ts (6 hunks)
src/main/llm.ts (5 hunks)
src/main/tipc.ts (5 hunks)
src/preload/index.d.ts (1 hunks)
src/preload/index.ts (2 hunks)
src/renderer/src/components/text-input-panel.tsx (4 hunks)
src/renderer/src/pages/panel.tsx (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

src/main/tipc.ts (1)

src/main/config.ts (1)

configStore (142-142)

src/renderer/src/pages/panel.tsx (1)

src/renderer/src/lib/tipc-client.ts (1)

tipcClient (12-14)

🔇 Additional comments (3)

src/main/tipc.ts (2)

619-688: Screenshot parameter properly threaded through agent mode flow.

The implementation correctly forwards the screenshot parameter through the MCP text input flow to processWithAgentMode. The screenshot will be included in the conversation history and multimodal LLM prompts as described in the PR objectives.

Note: Ensure screenshot validation is added as suggested in the earlier comment on processWithAgentMode.

566-617: Screenshot parameter accepted but unused in createTextInput function.

The createTextInput function accepts a screenshot parameter (line 569) but never uses it anywhere in the function body. The screenshot is passed to this function when mcpToolsEnabled is false (line 630), but it's silently discarded.

Screenshots are only processed in the MCP flow via processWithAgentMode, where they're sent to LLMs as multimodal content. The fallback text-input path doesn't persist or process screenshots despite accepting them in the input type definition.

Recommendation: Either remove screenshot from the createTextInput input type since it's not used, or document this limitation. Alternatively, if screenshots should be stored for non-MCP mode, add a screenshot field to RecordingHistoryItem and persist them.

src/renderer/src/pages/panel.tsx (1)

352-375: Screenshot parameter properly propagated through submission flow.

The implementation correctly forwards the screenshot parameter to both MCP and standard text input mutations, ensuring it reaches the backend processing layer. The conversation message creation appropriately includes only the text, keeping the screenshot as context for the LLM rather than display content.

coderabbitai · 2025-10-31T18:55:17Z

src/main/tipc.ts

 async function processWithAgentMode(
  text: string,
  conversationId?: string,
+  screenshot?: string,
 ): Promise<string> {


⚠️ Potential issue | 🟠 Major

Add validation for screenshot parameter.

The screenshot parameter accepts base64-encoded image data that can be 2-3MB for the configured 1920x1080 resolution. Without validation, this could lead to memory issues or API payload size limit violations downstream.

Consider adding validation before processing:

async function processWithAgentMode( text: string, conversationId?: string, screenshot?: string, ): Promise<string> { + // Validate screenshot if provided + if (screenshot) { + // Check format (should be data URL) + if (!screenshot.startsWith('data:image/')) { + throw new Error('Invalid screenshot format') + } + // Check size (e.g., 10MB limit) + const sizeInMB = (screenshot.length * 0.75) / (1024 * 1024) // approximate base64 size + if (sizeInMB > 10) { + throw new Error(`Screenshot too large: ${sizeInMB.toFixed(1)}MB (max 10MB)`) + } + } + const config = configStore.get()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async function processWithAgentMode(

text: string,

conversationId?: string,

screenshot?: string,

): Promise<string> {

async function processWithAgentMode(

text: string,

conversationId?: string,

screenshot?: string,

): Promise<string> {

// Validate screenshot if provided

if (screenshot) {

// Check format (should be data URL)

if (!screenshot.startsWith('data:image/')) {

throw new Error('Invalid screenshot format')

}

// Check size (e.g., 10MB limit)

const sizeInMB = (screenshot.length * 0.75) / (1024 * 1024) // approximate base64 size

if (sizeInMB > 10) {

throw new Error(`Screenshot too large: ${sizeInMB.toFixed(1)}MB (max 10MB)`)

}

}

const config = configStore.get()

coderabbitai · 2025-10-31T18:55:17Z

apps/desktop/src/preload/index.ts

+  // Screenshot API
+  desktopCapturer: {
+    getSources: (options: any) => desktopCapturer.getSources(options)
+  }


⚠️ Potential issue | 🔴 Critical

Expose desktopCapturer on the object the renderer actually uses

contextBridge.exposeInMainWorld("electron", electronAPI) still publishes the unmodified toolkit API, so the newly added screenshot helper only exists on window.electronAPI.desktopCapturer. The renderer code calls window.electron.desktopCapturer.getSources(...), which will be undefined at runtime—screenshot capture never fires and throws. Please merge the desktopCapturer proxy into the object you expose as window.electron so the preload and renderer stay in sync.

const api = { @@ - desktopCapturer: { - getSources: (options: any) => desktopCapturer.getSources(options) - } + desktopCapturer: { + getSources: (options: any) => desktopCapturer.getSources(options) + } } +const electronWithDesktopCapturer = { + ...electronAPI, + desktopCapturer: api.desktopCapturer, +} + if (process.contextIsolated) { try { - contextBridge.exposeInMainWorld("electron", electronAPI) + contextBridge.exposeInMainWorld("electron", electronWithDesktopCapturer) contextBridge.exposeInMainWorld("electronAPI", api) @@ - window.electron = electronAPI + window.electron = electronWithDesktopCapturer window.electronAPI = api }

Committable suggestion skipped: line range outside the PR's diff.

src/renderer/src/components/text-input-panel.tsx

- Only send screenshot when includeScreenshot is true - Clear screenshot data when user unchecks the box - Prevents unintended screenshot transmission - Respects user privacy expectations

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

src/renderer/src/components/text-input-panel.tsx (3)
68-68: Still using wrong API path - critical runtime error.

Past reviews flagged that window.electron.desktopCapturer is undefined; the preload script exposes it on window.electronAPI.desktopCapturer. This will throw at runtime and prevent screenshot capture.

Apply this diff:
-      const sources = await (window as any).electron.desktopCapturer.getSources({
+      const sources = await (window as any).electronAPI.desktopCapturer.getSources({
74-75: Primary display assumption unreliable on multi-monitor setups.

desktopCapturer doesn't guarantee source ordering, so sources[0] may capture a secondary screen. Consider filtering for the primary display or documenting this limitation.

96-103: Missing dependency in useEffect.

The effect references screenshot in line 97 and 99 but doesn't include it in the dependency array. React's exhaustive-deps rule flags this. Add screenshot to the dependencies or use a callback to avoid potential stale closure issues.
-  }, [includeScreenshot])
+  }, [includeScreenshot, screenshot])
Note: You may also need to add isCapturingScreenshot as a guard to prevent re-capture loops.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ad19c9 and dfe0269.

📒 Files selected for processing (1)

src/renderer/src/components/text-input-panel.tsx (4 hunks)

🔇 Additional comments (3)

src/renderer/src/components/text-input-panel.tsx (3)

85-93: Privacy concern properly addressed.

The submission now correctly respects the includeScreenshot toggle and only sends the screenshot when both the checkbox is checked AND a screenshot exists. This resolves the privacy issue flagged in previous reviews.

96-103: Screenshot clearing properly implemented.

The effect correctly clears the screenshot when the user unchecks the box, addressing the privacy concern from previous reviews.

193-212: UI implementation looks good.

The screenshot checkbox, status indicators, and disabled states are properly implemented with clear visual feedback.

coderabbitai · 2025-10-31T19:10:09Z

apps/desktop/src/renderer/src/components/text-input-panel.tsx

+    } catch (error) {
+      console.error('Failed to capture screenshot:', error)
+    } finally {
+      setIsCapturingScreenshot(false)
+    }


⚠️ Potential issue | 🟡 Minor

Provide user feedback when screenshot capture fails.

Errors are logged but users see no indication that capture failed. The checkbox stays checked and shows "Capturing..." indefinitely if an error occurs, creating a confusing state.

Consider showing a brief error message or toast notification when capture fails.

🤖 Prompt for AI Agents

In src/renderer/src/components/text-input-panel.tsx around lines 78 to 82, the catch block only logs the error so the UI stays in a "Capturing..." state and the checkbox remains checked; update the catch to set the capturing state back to false (or ensure finally still runs) and surface a brief user-facing error (e.g., trigger an existing toast/notification or set a local error message state that displays a short inline message) so users know the capture failed; keep the console.error for debugging but call the toast/notification helper (or setError) with a friendly message like "Screenshot failed" so the checkbox and label are restored and the user receives feedback.

aj47 · 2025-11-01T17:24:13Z

augment review

augmentcode

Review completed. No suggestions at this time.

Resolves merge conflicts from monorepo restructure: - Fixed content conflict in llm-fetch.ts (combined multimodal content type with sessionId/onRetryProgress) - Applied screenshot/multimodal changes to new file locations: - apps/desktop/src/main/llm.ts: Added screenshot parameter and multimodal message handling - apps/desktop/src/main/tipc.ts: Added screenshot parameter to processWithAgentMode and router - apps/desktop/src/renderer/src/pages/panel.tsx: Added screenshot to mutations and handleTextSubmit - Removed old src/ files that were deleted in main but modified in PR Fixes #671

aj47 · 2025-12-20T23:22:45Z

augment review

augmentcode · 2025-12-20T23:27:23Z

🤖 Augment PR Summary

Summary: Adds optional screenshot capture so users can send multimodal (text + image) prompts to vision-capable models.

Changes:

Renderer: adds an “Include screenshot” toggle to the text input panel, capture state feedback, and a preview dialog.
Settings: introduces screenshot preferences (always include for text input, auto-capture for voice commands, and display selection for multi-monitor setups).
Main/preload: exposes screen capture via IPC (main-process desktopCapturer handler + preload bridge) and adds a helper to capture screenshots from main.
TIPC/agent pipeline: threads an optional screenshot data URL through text/MCP entrypoints into agent-mode processing.
LLM layer: supports OpenAI-style multimodal content arrays and updates context/logging utilities to tolerate non-string message content.

Technical Notes: Screenshots are sent as data:image/… URLs (default 1920×1080) with basic size/format validation, and macOS permission failures are surfaced with actionable errors.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2025-12-20T23:27:24Z

apps/desktop/src/renderer/src/components/text-input-panel.tsx

+      // Clear screenshot when user unchecks the box
+      setScreenshot(null)
+    }
+  }, [includeScreenshot])


Because this effect only depends on includeScreenshot, an in-flight captureScreenshot() can still call setScreenshot(...) after the user unchecks the box, leaving a “captured” image in state while the toggle is off. Consider making the capture lifecycle resilient to that race so the UI state can’t re-populate a screenshot after opt-out.

_{🤖 Was this useful? React with 👍 or 👎}

augmentcode · 2025-12-20T23:27:24Z

apps/desktop/src/main/llm-fetch.ts

  // Convert messages to Gemini format
-  const prompt = messages.map((m) => `${m.role}: ${m.content}`).join("\n\n")
+  const prompt = messages.map((m) => {
+    const content = typeof m.content === 'string' ? m.content : JSON.stringify(m.content)


For Gemini, JSON.stringify(m.content) will embed the full screenshot data URL into the text prompt, which is likely to blow past payload/token limits and isn’t Gemini’s supported multimodal format. Consider stripping image parts (or mapping them to Gemini-native parts) when the provider isn’t OpenAI-compatible vision.

_{🤖 Was this useful? React with 👍 or 👎}

In Electron 31+, desktopCapturer is only available in the main process. This was causing the preload script to fail to load, breaking the entire UI because window.electron.ipcRenderer was undefined. Changes: - Remove desktopCapturer import from preload script - Add IPC handler 'getScreenSources' in main process - Update text-input-panel to use new IPC-based API - Update type definitions for the new API

- Add captureWantedRef to prevent setting screenshot state after user unchecks - Add screenshotError state to show user feedback when capture fails - Uncheck the checkbox automatically when capture fails - Add proper dependencies to useEffect for React best practices Addresses PR review comments about race conditions and missing error feedback.

- Add extractTextFromContent helper to avoid embedding full image data URLs - Replace image parts with '[image attached]' placeholder for Gemini - Prevents payload/token limit issues when screenshots are in messages Addresses PR review comment about Gemini multimodal content handling.

- Validate screenshot format (must be data:image/ data URL) - Validate screenshot size (maximum 10MB) - Throw descriptive errors for invalid screenshots Addresses PR review comment about screenshot validation.

aj47 · 2025-12-20T23:45:08Z

augment review

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2025-12-20T23:50:23Z

apps/desktop/src/main/tipc.ts

      // Pass existingSessionId to reuse the session if found
      // When fromTile=true, start snoozed so the floating panel doesn't appear
-      processWithAgentMode(input.text, conversationId, existingSessionId, input.fromTile ?? false)
+      processWithAgentMode(input.text, conversationId, existingSessionId, input.fromTile ?? false, input.screenshot)


This is a fire-and-forget promise chain without a .catch(), so the new screenshot validation errors (and other failures inside processWithAgentMode) can surface as unhandled promise rejections. Consider ensuring failures are caught/logged (and ideally surfaced to the UI) to avoid silent session failures.

_{🤖 Was this useful? React with 👍 or 👎}

augmentcode · 2025-12-20T23:50:23Z

apps/desktop/src/main/llm-fetch.ts

 */
 async function makeLLMCallAttempt(
-  messages: Array<{ role: string; content: string }>,
+  messages: Array<{ role: string; content: string | any[] }>,


makeLLMCallAttempt now allows content to be an array, but the debug log inside this function still does messages[...].content.substring(...) which will throw when isDebugLLM() is enabled and a screenshot is attached. The debug/logging path should tolerate non-string content values.

_{🤖 Was this useful? React with 👍 or 👎}

- Fix macOS Screen Recording permission detection (empty sources array) - Add helper functions to handle multimodal content (string or array) - Add detail: high to image_url for xAI/Grok vision support - Improve error handling in text-input-panel for screenshot errors

… monitor selection - Add screenshot support for voice/recording input (passes through triggerMcpRecording and createMcpRecording) - Add 'Always include screenshot' setting in Settings > Shortcuts > Text Input - Add preview button (eye icon) next to screenshot checkbox to preview before sending - Add monitor/display selection in Settings > Screenshot section - Screenshot capture now uses configured display with fallback to primary

aj47 · 2025-12-23T02:09:14Z

augment review

augmentcode · 2025-12-23T02:09:20Z

This pull request is too large for Augment to review. The PR exceeds the maximum size limit of 100000 tokens (approximately 400000 characters) for automated code review. Please consider breaking this PR into smaller, more focused changes.

aj47 · 2025-12-23T02:25:22Z

augment review

augmentcode · 2025-12-23T02:25:28Z

This pull request is too large for Augment to review. The PR exceeds the maximum size limit of 100000 tokens (approximately 400000 characters) for automated code review. Please consider breaking this PR into smaller, more focused changes.

aj47 · 2025-12-23T03:10:56Z

augment review

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2025-12-23T03:15:03Z

apps/desktop/src/renderer/src/components/text-input-panel.tsx

+      onSubmit(text.trim(), screenshotToSend)
      setText("")
+      setScreenshot(null)
+      setIncludeScreenshot(false)


handleSubmit() always does setIncludeScreenshot(false), which means the alwaysIncludeScreenshot setting won’t persist across submissions (since the sync effect only reruns when alwaysIncludeScreenshot changes). Consider resetting includeScreenshot back to alwaysIncludeScreenshot after submit so the config is consistently honored.

_{🤖 Was this useful? React with 👍 or 👎}

augmentcode · 2025-12-23T03:15:03Z

apps/desktop/src/main/window.ts

+    })
+
+    if (sources.length === 0) {
+      logApp('[captureScreenshotFromMain] No screen sources available')


On macOS, desktopCapturer.getSources() can return an empty array when Screen Recording permission isn’t granted; captureScreenshotFromMain() currently just logs and returns undefined, so screenshotForVoiceCommands can silently do nothing. Consider surfacing a clearer error path/message (similar to the IPC handler) so users know they need to grant permission and restart.

_{🤖 Was this useful? React with 👍 or 👎}

- Reset includeScreenshot to alwaysIncludeScreenshot config value after submit (instead of always resetting to false) - Improve captureScreenshotFromMain error logging for macOS permission issues to help users understand they need to grant Screen Recording permission

aj47 · 2025-12-23T03:23:13Z

augment review

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2025-12-23T03:34:01Z

apps/desktop/src/main/llm.ts

  const mapConversationToMessages = (
    addSummaryPrompt: boolean = false
-  ): Array<{ role: "user" | "assistant"; content: string }> => {
+  ): Array<{ role: "user" | "assistant"; content: string | any[] }> => {


mapConversationToMessages now emits content as string | any[] for screenshot messages, but downstream context-budgeting (shrinkMessagesForLLM / LLMMessage in context-budget.ts) still assumes content is a string and uses string-only operations. This can break agent runs when a screenshot is attached and context reduction is enabled; consider normalizing multimodal content before context reduction (or extending the reducer to handle multimodal arrays).

_{🤖 Was this useful? React with 👍 or 👎}

augmentcode · 2025-12-23T03:34:01Z

apps/desktop/src/main/index.ts

  logApp("IPC main registered")

+  // Register desktopCapturer handler (available only in main process in Electron 31+)
+  ipcMain.handle('getScreenSources', async (_event, options: { types: ('screen' | 'window')[], thumbnailSize?: { width: number, height: number } }) => {


getScreenSources forwards renderer-supplied options directly into desktopCapturer.getSources; consider validating/clamping types and thumbnailSize so a renderer bug (or injected script) can’t request unexpected capture modes or extreme sizes. Even if the UI currently uses only screen at 1920×1080, tightening this IPC surface helps prevent accidental DoS/privacy regressions.

_{🤖 Was this useful? React with 👍 or 👎}

- Fix captureScreenshotFromMain to throw errors instead of silently returning undefined - On macOS, throws clear error about Screen Recording permission - Callers now properly handle errors with try-catch - Fix context-budget.ts to handle multimodal content (string | any[]) - Add getContentLength() helper for multimodal content length calculation - Add getContentAsString() helper for extracting text from multimodal content - Update LLMMessage type to support multimodal content - Update all content operations to use helper functions - Add input validation to getScreenSources IPC handler - Filter types to only allow 'screen' (not 'window') for privacy - Clamp thumbnailSize to reasonable bounds (100-4096 pixels) - Set sensible defaults if options not provided

aj47 · 2025-12-23T03:37:24Z

augment review

augmentcode

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

aj47 · 2025-12-23T03:42:18Z

augment review

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2025-12-23T03:48:04Z

apps/desktop/src/renderer/src/components/text-input-panel.tsx

+          <DialogHeader>
+            <DialogTitle>Screenshot Preview</DialogTitle>
+            <DialogDescription>
+              This is what will be sent with your message


The preview dialog says “This is what will be sent with your message”, but handlePreviewScreenshot() captures a fresh image into previewImage without updating the screenshot state used by submit, so the preview can diverge from the actual payload (potential privacy surprise). Consider ensuring the preview reflects the same screenshot that will be submitted.

_{🤖 Was this useful? React with 👍 or 👎}

augmentcode · 2025-12-23T03:48:04Z

apps/desktop/src/main/llm.ts

        if (!content) return null
+
+        // Handle multimodal content (text + screenshot)
+        if (entry.screenshot) {


Because screenshots are stored on conversationHistory entries and emitted into LLM messages, follow-up turns will resend all prior screenshots in the conversation. This can grow request payloads quickly (even with a per-image cap) and may hit provider/request-size limits on longer sessions; consider how/when older screenshots should be dropped or summarized.

_{🤖 Was this useful? React with 👍 or 👎}

Address PR review comment about preview divergence. The preview dialog now uses the existing screenshot state instead of capturing a fresh one, ensuring what users see matches what will actually be sent with their message. - Use existing screenshot state for preview if available - Only capture fresh screenshot when none exists - Update screenshot state after capture to keep preview in sync

Address PR review comment about consistent error logging. The catch handler for the fire-and-forget processWithAgentMode call now uses logApp for consistency with other error logging in the tipc module.

aj47 · 2025-12-23T03:52:17Z

augment review

augmentcode

Review completed. 1 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2025-12-23T03:58:25Z

apps/desktop/src/main/window.ts


+/**
+ * Capture a screenshot from the configured display (or primary display)
+ * Returns the screenshot as a data URL, or undefined if capture fails


This doc comment says capture failures return undefined, but the implementation logs and then rethrows, so callers need to handle exceptions rather than checking for undefined. Consider aligning the comment with the actual behavior to avoid misuse.

_{🤖 Was this useful? React with 👍 or 👎}

chatgpt-codex-connector bot reviewed Oct 31, 2025

View reviewed changes

augmentcode bot reviewed Oct 31, 2025

View reviewed changes

coderabbitai bot reviewed Oct 31, 2025

View reviewed changes

Fix privacy issue: clear screenshot when checkbox is unchecked

dfe0269

- Only send screenshot when includeScreenshot is true - Clear screenshot data when user unchecks the box - Prevents unintended screenshot transmission - Respects user privacy expectations

coderabbitai bot reviewed Oct 31, 2025

View reviewed changes

augmentcode bot reviewed Nov 1, 2025

View reviewed changes

aj47 mentioned this pull request Dec 20, 2025

Fix merge conflict between PR #220 and main branch #671

Closed

aj47 added the slot-1 label Dec 20, 2025

augmentcode bot reviewed Dec 20, 2025

View reviewed changes

aj47 added 4 commits December 20, 2025 15:40

fix: add validation for screenshot parameter in processWithAgentMode

3445d12

- Validate screenshot format (must be data:image/ data URL) - Validate screenshot size (maximum 10MB) - Throw descriptive errors for invalid screenshots Addresses PR review comment about screenshot validation.

augmentcode bot reviewed Dec 20, 2025

View reviewed changes

aj47 added 2 commits December 20, 2025 16:41

Remove screenshot for dictation feature

9268673

Remove pnpm-lock.yaml from PR

43df5ea

augmentcode bot reviewed Dec 23, 2025

View reviewed changes

aj47 added 2 commits December 22, 2025 19:51

fix: use logApp instead of logLLM for agent processing errors

98cbe78

Address PR review comment about consistent error logging. The catch handler for the fire-and-forget processWithAgentMode call now uses logApp for consistency with other error logging in the tipc module.

augmentcode bot reviewed Dec 23, 2025

View reviewed changes

Conversation

aj47 commented Oct 31, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

UI Components

Backend Implementation

Multimodal Message Format

Type Safety

Testing

Fixes

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

aj47 commented Oct 31, 2025

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

aj47 commented Nov 1, 2025

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

aj47 commented Dec 20, 2025

Uh oh!

augmentcode bot commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

aj47 commented Oct 31, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 31, 2025 •

edited

Loading

augmentcode bot commented Dec 20, 2025 •

edited

Loading