-
Notifications
You must be signed in to change notification settings - Fork 13
feat(agent): Implement working multi-step calculator demo loop #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Refactors `demo.py` and `test_deploy_and_parse.py` to use `VisualState`. - `VisualState` now handles screenshotting, calling the deployed OmniParser server via `OmniParserClient`, and mapping the JSON response to `UIElement` objects. - Adds command-line argument parsing for `user_goal` to `demo.py`. - Includes fixes to `server.py` for robust deployment and alarm-based auto-shutdown. - Verified end-to-end perception pipeline (screenshot->parse->map) successfully returns structured elements. Note: `demo.py` still uses simulation for state transitions after planning. E2E tests remain skipped/commented out.
- Fixes Lambda permission issue allowing CloudWatch Alarms to trigger stop. - Adds waiter after Lambda code update to prevent ResourceConflictException. - Implements robust instance state handling in deploy_ec2_instance (ignores shutting-down/terminated). - Adds --restart always policy to docker run command. - Ensures Deploy.start returns IP/ID for client initialization. - Includes previous fixes for gpg tty error and Lambda AWS_REGION env var. - Deployment now successfully completes end-to-end.
- Adds format_chat_messages utility for readable prompt logging. - Implements DEBUG level logging for full LLM prompt messages and response JSON in plan_action_for_ui (core.py). - Configures loguru to output DEBUG logs to a timestamped file in logs/. - Essential for debugging LLM planning behaviour.
…ture - Refactors `omnimcp.omnimcp.VisualState` and `OmniMCP` classes to use `OmniParserClient` instead of the defunct `OmniParserProvider`. - Integrates OmniParser response mapping logic directly into `VisualState`. - Fixes logic errors in placeholder `VisualState.find_element` and `OmniMCP._verify_action` (numpy bool). - Updates mocking and assertions in `tests/test_omnimcp_core.py` to align with refactored classes; these tests now pass. - Consolidates test helpers into `omnimcp.testing_utils.py`. - Cleans up test directory structure: Moves all tests to root `tests/`, removes `omnimcp/tests/`, renames/removes helper/duplicate test files. - Moves CI testing strategy document to `docs/testing_strategy.md`. Non-e2e tests now pass. E2E tests remain skipped/commented out (tracked separately).
Successfully executes the "Open calculator and compute 5*9" goal end-to-end by integrating perception, planning, and action execution. Key Changes: - Integrated VisualState (using OmniParser client) for screen perception. - Integrated LLM planner (using core.py with Claude Sonnet) for generating actions based on UI elements, goal, and history. - Implemented InputController (omnimcp/input.py using pynput) for robust mouse/keyboard control. - Added parsing logic in InputController to handle LLM key strings (e.g., "Cmd+Space", "Enter", "shift+a"). - Fixed coordinate space mismatch for mouse actions on macOS Retina displays using AppKit's backingScaleFactor via get_scaling_factor(). - Refined LLM prompt in core.py for better multi-step app launch planning and added OS platform context. - Added timestamped output directories (`images/YYYYMMDD_HHMMSS/`) for demo runs. - Added saving of multiple debug images per step: raw state (`_state_raw.png`), parsed state with bounding boxes (`_state_parsed.png`), and action highlight/annotation (`_action_highlight.png`). - Resolved various TypeErrors, NameErrors, and Pydantic validation errors encountered during development, including handling of platform-specific keys and goal completion LLM output. Known Issues / TODO: - High latency (~15s+) in VisualState.update() due to OmniParser/network requires investigation. - Accuracy/consistency of OmniParser's bounding boxes needs review (e.g., for Spotlight elements). - Action highlight visualization (`draw_action_highlight`) accuracy depends on OmniParser bounds. - Truncating the element list sent to the LLM needs a more sophisticated approach.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces the first working end-to-end implementation of the OmniMCP agent, successfully completing a multi-step UI automation task: "Open calculator and compute 5 * 9".
Summary of Changes:
VisualStatewithOmniParserClientfor screen perception.core.plan_action_for_uiwith Claude Sonnet for action planning.InputControllerusingpynputfor robust, cross-platform action execution.Cmd+Space).get_scaling_factor.Cmd+Space) viaInputController.execute_key_string.TypeError,NameError, and Pydantic validation issues encountered during development.core.pyfor better multi-step planning and added OS platform context.images/._state_raw.png), screenshot with parsed bounding boxes (_state_parsed.png), and action highlight/annotation (_action_highlight.png).How to Test:
pynput,pyobjc-framework-Cocoaon macOS, etc.) are installed.ANTHROPIC_API_KEY, potentially AWS creds for OmniParser deployment) are set.python demo.py ["Optional natural language goal"]python demo.py "Open calculator and compute 5 * 9"(Wait for the steps to execute)
Demo:
Demo GIF demonstrating the calculator task will be added here before merging.
Known Issues / Next Steps:
VisualState.update()latency is high (~15s+ per step) and needs investigation (OmniParser server vs. network vs. client logic).is_goal_completeis true (though validation now handles this).