Skip to content

Conversation

@abrichr
Copy link
Member

@abrichr abrichr commented Apr 1, 2025

This PR introduces the first working end-to-end implementation of the OmniMCP agent, successfully completing a multi-step UI automation task: "Open calculator and compute 5 * 9".

Summary of Changes:

  • Integrated Core Components: Successfully combines:
    • VisualState with OmniParserClient for screen perception.
    • core.plan_action_for_ui with Claude Sonnet for action planning.
    • InputController using pynput for robust, cross-platform action execution.
  • End-to-End Execution: The agent now successfully performs the calculator task by:
    • Opening Spotlight/OS Search (Cmd+Space).
    • Typing the application name ("Calculator").
    • Pressing Enter to launch the app.
    • Typing the calculation ("5 * 9").
    • Pressing Enter to get the result.
    • Recognizing goal completion via the LLM.
  • Key Fixes:
    • Resolved coordinate space mismatch for mouse clicks on macOS Retina displays using get_scaling_factor.
    • Implemented reliable key combination execution (e.g., Cmd+Space) via InputController.execute_key_string.
    • Fixed various TypeError, NameError, and Pydantic validation issues encountered during development.
    • Improved LLM prompt in core.py for better multi-step planning and added OS platform context.
  • Enhanced Debugging:
    • Demo runs now save outputs to unique timestamped directories under images/.
    • Each step saves: raw screenshot (_state_raw.png), screenshot with parsed bounding boxes (_state_parsed.png), and action highlight/annotation (_action_highlight.png).

How to Test:

  1. Ensure required libraries (pynput, pyobjc-framework-Cocoa on macOS, etc.) are installed.
  2. Ensure necessary environment variables (ANTHROPIC_API_KEY, potentially AWS creds for OmniParser deployment) are set.
  3. Run the demo from the project root directory:
    python demo.py ["Optional natural language goal"]
    Example: python demo.py "Open calculator and compute 5 * 9"
    (Wait for the steps to execute)

Demo:

Demo GIF demonstrating the calculator task will be added here before merging.

Known Issues / Next Steps:

  • Performance: VisualState.update() latency is high (~15s+ per step) and needs investigation (OmniParser server vs. network vs. client logic).
  • Visualization Accuracy: The accuracy of bounding boxes in saved images (parsed state, action highlight) depends on OmniParser's output and may need refinement for specific UI elements (like Spotlight).
  • Robustness:
    • LLM planning can still be brittle; further prompt engineering or alternative planning strategies may be needed for more complex tasks.
    • The strategy of truncating the UI element list passed to the LLM needs improvement.
  • Goal Completion: LLM sometimes outputs minor superfluous actions even when is_goal_complete is true (though validation now handles this).

abrichr added 12 commits March 30, 2025 22:26
- Refactors `demo.py` and `test_deploy_and_parse.py` to use `VisualState`.
- `VisualState` now handles screenshotting, calling the deployed OmniParser server via `OmniParserClient`, and mapping the JSON response to `UIElement` objects.
- Adds command-line argument parsing for `user_goal` to `demo.py`.
- Includes fixes to `server.py` for robust deployment and alarm-based auto-shutdown.
- Verified end-to-end perception pipeline (screenshot->parse->map) successfully returns structured elements.

Note: `demo.py` still uses simulation for state transitions after planning. E2E tests remain skipped/commented out.
- Fixes Lambda permission issue allowing CloudWatch Alarms to trigger stop.
- Adds waiter after Lambda code update to prevent ResourceConflictException.
- Implements robust instance state handling in deploy_ec2_instance (ignores shutting-down/terminated).
- Adds --restart always policy to docker run command.
- Ensures Deploy.start returns IP/ID for client initialization.
- Includes previous fixes for gpg tty error and Lambda AWS_REGION env var.
- Deployment now successfully completes end-to-end.
- Adds format_chat_messages utility for readable prompt logging.
- Implements DEBUG level logging for full LLM prompt messages and response JSON in plan_action_for_ui (core.py).
- Configures loguru to output DEBUG logs to a timestamped file in logs/.
- Essential for debugging LLM planning behaviour.
…ture

- Refactors `omnimcp.omnimcp.VisualState` and `OmniMCP` classes to use `OmniParserClient` instead of the defunct `OmniParserProvider`.
- Integrates OmniParser response mapping logic directly into `VisualState`.
- Fixes logic errors in placeholder `VisualState.find_element` and `OmniMCP._verify_action` (numpy bool).
- Updates mocking and assertions in `tests/test_omnimcp_core.py` to align with refactored classes; these tests now pass.
- Consolidates test helpers into `omnimcp.testing_utils.py`.
- Cleans up test directory structure: Moves all tests to root `tests/`, removes `omnimcp/tests/`, renames/removes helper/duplicate test files.
- Moves CI testing strategy document to `docs/testing_strategy.md`.

Non-e2e tests now pass. E2E tests remain skipped/commented out (tracked separately).
Successfully executes the "Open calculator and compute 5*9" goal end-to-end by integrating perception, planning, and action execution.

Key Changes:
- Integrated VisualState (using OmniParser client) for screen perception.
- Integrated LLM planner (using core.py with Claude Sonnet) for generating actions based on UI elements, goal, and history.
- Implemented InputController (omnimcp/input.py using pynput) for robust mouse/keyboard control.
- Added parsing logic in InputController to handle LLM key strings (e.g., "Cmd+Space", "Enter", "shift+a").
- Fixed coordinate space mismatch for mouse actions on macOS Retina displays using AppKit's backingScaleFactor via get_scaling_factor().
- Refined LLM prompt in core.py for better multi-step app launch planning and added OS platform context.
- Added timestamped output directories (`images/YYYYMMDD_HHMMSS/`) for demo runs.
- Added saving of multiple debug images per step: raw state (`_state_raw.png`), parsed state with bounding boxes (`_state_parsed.png`), and action highlight/annotation (`_action_highlight.png`).
- Resolved various TypeErrors, NameErrors, and Pydantic validation errors encountered during development, including handling of platform-specific keys and goal completion LLM output.

Known Issues / TODO:
- High latency (~15s+) in VisualState.update() due to OmniParser/network requires investigation.
- Accuracy/consistency of OmniParser's bounding boxes needs review (e.g., for Spotlight elements).
- Action highlight visualization (`draw_action_highlight`) accuracy depends on OmniParser bounds.
- Truncating the element list sent to the LLM needs a more sophisticated approach.
@abrichr abrichr merged commit 35a2bdc into main Apr 1, 2025
1 check passed
@abrichr abrichr deleted the feat/real-action-loop branch April 1, 2025 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants