feat(agent): Implement working multi-step calculator demo loop #20

abrichr · 2025-04-01T16:52:29Z

This PR introduces the first working end-to-end implementation of the OmniMCP agent, successfully completing a multi-step UI automation task: "Open calculator and compute 5 * 9".

Summary of Changes:

Integrated Core Components: Successfully combines:
- VisualState with OmniParserClient for screen perception.
- core.plan_action_for_ui with Claude Sonnet for action planning.
- InputController using pynput for robust, cross-platform action execution.
End-to-End Execution: The agent now successfully performs the calculator task by:
- Opening Spotlight/OS Search (Cmd+Space).
- Typing the application name ("Calculator").
- Pressing Enter to launch the app.
- Typing the calculation ("5 * 9").
- Pressing Enter to get the result.
- Recognizing goal completion via the LLM.
Key Fixes:
- Resolved coordinate space mismatch for mouse clicks on macOS Retina displays using get_scaling_factor.
- Implemented reliable key combination execution (e.g., Cmd+Space) via InputController.execute_key_string.
- Fixed various TypeError, NameError, and Pydantic validation issues encountered during development.
- Improved LLM prompt in core.py for better multi-step planning and added OS platform context.
Enhanced Debugging:
- Demo runs now save outputs to unique timestamped directories under images/.
- Each step saves: raw screenshot (_state_raw.png), screenshot with parsed bounding boxes (_state_parsed.png), and action highlight/annotation (_action_highlight.png).

How to Test:

Ensure required libraries (pynput, pyobjc-framework-Cocoa on macOS, etc.) are installed.
Ensure necessary environment variables (ANTHROPIC_API_KEY, potentially AWS creds for OmniParser deployment) are set.
Run the demo from the project root directory:
```
python demo.py ["Optional natural language goal"]
```
Example: python demo.py "Open calculator and compute 5 * 9"
(Wait for the steps to execute)

Demo:

Demo GIF demonstrating the calculator task will be added here before merging.

Known Issues / Next Steps:

Performance: VisualState.update() latency is high (~15s+ per step) and needs investigation (OmniParser server vs. network vs. client logic).
Visualization Accuracy: The accuracy of bounding boxes in saved images (parsed state, action highlight) depends on OmniParser's output and may need refinement for specific UI elements (like Spotlight).
Robustness:
- LLM planning can still be brittle; further prompt engineering or alternative planning strategies may be needed for more complex tasks.
- The strategy of truncating the UI element list passed to the LLM needs improvement.
Goal Completion: LLM sometimes outputs minor superfluous actions even when is_goal_complete is true (though validation now handles this).

- Refactors `demo.py` and `test_deploy_and_parse.py` to use `VisualState`. - `VisualState` now handles screenshotting, calling the deployed OmniParser server via `OmniParserClient`, and mapping the JSON response to `UIElement` objects. - Adds command-line argument parsing for `user_goal` to `demo.py`. - Includes fixes to `server.py` for robust deployment and alarm-based auto-shutdown. - Verified end-to-end perception pipeline (screenshot->parse->map) successfully returns structured elements. Note: `demo.py` still uses simulation for state transitions after planning. E2E tests remain skipped/commented out.

- Fixes Lambda permission issue allowing CloudWatch Alarms to trigger stop. - Adds waiter after Lambda code update to prevent ResourceConflictException. - Implements robust instance state handling in deploy_ec2_instance (ignores shutting-down/terminated). - Adds --restart always policy to docker run command. - Ensures Deploy.start returns IP/ID for client initialization. - Includes previous fixes for gpg tty error and Lambda AWS_REGION env var. - Deployment now successfully completes end-to-end.

- Adds format_chat_messages utility for readable prompt logging. - Implements DEBUG level logging for full LLM prompt messages and response JSON in plan_action_for_ui (core.py). - Configures loguru to output DEBUG logs to a timestamped file in logs/. - Essential for debugging LLM planning behaviour.

…ture - Refactors `omnimcp.omnimcp.VisualState` and `OmniMCP` classes to use `OmniParserClient` instead of the defunct `OmniParserProvider`. - Integrates OmniParser response mapping logic directly into `VisualState`. - Fixes logic errors in placeholder `VisualState.find_element` and `OmniMCP._verify_action` (numpy bool). - Updates mocking and assertions in `tests/test_omnimcp_core.py` to align with refactored classes; these tests now pass. - Consolidates test helpers into `omnimcp.testing_utils.py`. - Cleans up test directory structure: Moves all tests to root `tests/`, removes `omnimcp/tests/`, renames/removes helper/duplicate test files. - Moves CI testing strategy document to `docs/testing_strategy.md`. Non-e2e tests now pass. E2E tests remain skipped/commented out (tracked separately).

Successfully executes the "Open calculator and compute 5*9" goal end-to-end by integrating perception, planning, and action execution. Key Changes: - Integrated VisualState (using OmniParser client) for screen perception. - Integrated LLM planner (using core.py with Claude Sonnet) for generating actions based on UI elements, goal, and history. - Implemented InputController (omnimcp/input.py using pynput) for robust mouse/keyboard control. - Added parsing logic in InputController to handle LLM key strings (e.g., "Cmd+Space", "Enter", "shift+a"). - Fixed coordinate space mismatch for mouse actions on macOS Retina displays using AppKit's backingScaleFactor via get_scaling_factor(). - Refined LLM prompt in core.py for better multi-step app launch planning and added OS platform context. - Added timestamped output directories (`images/YYYYMMDD_HHMMSS/`) for demo runs. - Added saving of multiple debug images per step: raw state (`_state_raw.png`), parsed state with bounding boxes (`_state_parsed.png`), and action highlight/annotation (`_action_highlight.png`). - Resolved various TypeErrors, NameErrors, and Pydantic validation errors encountered during development, including handling of platform-specific keys and goal completion LLM output. Known Issues / TODO: - High latency (~15s+) in VisualState.update() due to OmniParser/network requires investigation. - Accuracy/consistency of OmniParser's bounding boxes needs review (e.g., for Spotlight elements). - Action highlight visualization (`draw_action_highlight`) accuracy depends on OmniParser bounds. - Truncating the element list sent to the LLM needs a more sophisticated approach.

abrichr added 12 commits March 30, 2025 22:26

feat: Add detailed LLM prompt/completion logging

7692836

pytest.mark.skipif

cdaffca

conditionally import pynput in input.py

0196974

fix test_omnimcp_core.py

8771b47

remove test_deploy_screenshot.png

1ce041a

add demo_synthetic.py

ecdc363

replace make_gif.sh with make_gif.py; update README

edcb55e

abrichr merged commit 35a2bdc into main Apr 1, 2025
1 check passed

abrichr deleted the feat/real-action-loop branch April 1, 2025 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(agent): Implement working multi-step calculator demo loop #20

feat(agent): Implement working multi-step calculator demo loop #20

Uh oh!

abrichr commented Apr 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(agent): Implement working multi-step calculator demo loop #20

feat(agent): Implement working multi-step calculator demo loop #20

Uh oh!

Conversation

abrichr commented Apr 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants