diff --git a/README.md b/README.md
index 6f786d2..de9e59d 100644
--- a/README.md
+++ b/README.md
@@ -1,3 +1,8 @@
+You are absolutely right. The README has become quite verbose, and there's definite repetition (like the "Running the Demos" section appearing twice). Let's streamline it significantly, merge related sections, and shorten the examples.
+
+Here's a more terse and consolidated version of the `README.md`:
+
+```markdown
# OmniMCP
[](https://github.com/OpenAdaptAI/OmniMCP/actions/workflows/ci.yml)
@@ -5,417 +10,166 @@
[](https://www.python.org/)
[](https://github.com/astral-sh/ruff)
-OmniMCP provides rich UI context and interaction capabilities to AI models through [Model Context Protocol (MCP)](https://github.com/modelcontextprotocol) and [microsoft/OmniParser](https://github.com/microsoft/OmniParser). It focuses on enabling deep understanding of user interfaces through visual analysis, structured responses, and precise interaction.
+OmniMCP provides rich UI context and interaction capabilities to AI models through [Model Context Protocol (MCP)](https://github.com/modelcontextprotocol) and [microsoft/OmniParser](https://github.com/microsoft/OmniParser). It focuses on enabling deep understanding of user interfaces through visual analysis, structured planning, and precise interaction execution.
## Core Features
-- **Rich Visual Context**: Deep understanding of UI elements via OmniParser.
-- **LLM-based Planning**: Uses LLMs (e.g., Claude) to plan actions based on visual state, goal, and history.
-- **Real Action Execution**: Controls mouse and keyboard via `pynput` for executing planned actions.
-- **Cross-Platform Compatibility**: Tested on macOS (including Retina scaling); core components designed for cross-platform use (CI runs on Linux).
-- **Structured Types**: Clean, typed responses using Pydantic and dataclasses.
-- **Automated Deployment**: On-demand deployment of OmniParser backend to AWS EC2 with auto-shutdown.
-- **Debugging Visualizations**: Generates timestamped images per step (raw state, parsed elements, action highlight).
+- **Visual Perception:** Understands UI elements using OmniParser.
+- **LLM Planning:** Plans next actions based on goal, history, and visual state.
+- **Agent Executor:** Orchestrates the perceive-plan-act loop (`omnimcp/agent_executor.py`).
+- **Action Execution:** Controls mouse/keyboard via `pynput` (`omnimcp/input.py`).
+- **CLI Interface:** Simple entry point (`cli.py`) for running tasks.
+- **Auto-Deployment:** Optional OmniParser server deployment to AWS EC2 with auto-shutdown.
+- **Debugging:** Generates timestamped visual logs per step.
## Overview
-The system works by capturing the screen, parsing it to understand UI elements, planning the next action with an LLM based on a goal, and executing that action using input controllers.
+`cli.py` uses `AgentExecutor` to run a perceive-plan-act loop. It captures the screen (`VisualState`), plans using an LLM (`core.plan_action_for_ui`), and executes actions (`InputController`).
### Demos
-#### Real Action Demo (Calculator Task)
-
-This demonstrates the agent using real keyboard/mouse control to open Spotlight, search for Calculator, open it, and compute 5 * 9.
-
-
-*(GIF shows: Cmd+Space -> Type "Calculator" -> Enter -> Type "5 * 9" -> Enter. Final state shows Calculator with result 45.)*
-
-#### Synthetic UI Demo (Login Task)
-
-This demonstrates the planning loop working on generated UI images without real input control.
-
-
-*(GIF shows: Identifying username field, simulating typing; identifying password field, simulating typing; identifying login button, simulating click and transitioning to a final state.)*
-
-### Conceptual Flow
-
-
-Click to see conceptual flow diagrams
-
-
-
-
-
-1. **Spatial Feature Understanding**: OmniMCP begins by developing a deep understanding of the user interface's visual layout. Leveraging [microsoft/OmniParser](https://github.com/microsoft/OmniParser) (potentially deployed automatically to EC2), it performs detailed visual parsing, segmenting the screen and identifying all interactive and informational elements. This includes recognizing their types, content, spatial relationships, and attributes, creating a rich representation of the UI's static structure.
-
-
-
-
-
-
-
-2. **Temporal Feature Understanding**: To capture the dynamic aspects of the UI, OmniMCP tracks user interactions and the resulting state transitions. It records sequences of actions and changes within the UI, building a Process Graph that represents the flow of user workflows. This temporal understanding allows AI models to reason about interaction history and plan future actions based on context. (Note: Process Graph generation is a future goal).
-
-
-
-
-
-
-
-3. **Internal API Generation / Action Planning**: Utilizing the rich spatial and (optionally) temporal context it has acquired, OmniMCP leverages a Large Language Model (LLM) to plan the next action. Through In-Context Learning (prompting), the LLM dynamically determines the best action (e.g., click, type) and target element based on the current UI state, the user's goal, and the action history.
-
-
-
-
-
-
-
-4. **External API Publication (MCP)**: Optionally, OmniMCP can expose UI interaction capabilities through the [Model Context Protocol (MCP)](https://github.com/modelcontextprotocol). This provides a consistent interface for AI models (or other tools) to interact with the UI via standardized tools like `get_screen_state`, `click_element`, `type_text`, etc. (Note: MCP server implementation is currently experimental).
-
-
+- **Real Action (Calculator):** `python cli.py` opens Calculator and computes 5*9.
+ 
+- **Synthetic UI (Login):** `python demo_synthetic.py` uses generated images (no real I/O). *(Note: Pending refactor to use AgentExecutor).*
+ 
## Prerequisites
- Python >=3.10, <3.13
-- `uv` installed (`pip install uv` or see [Astral Docs](https://astral.sh/uv))
-- **macOS:** Requires `pyobjc-framework-Cocoa` (`uv pip install pyobjc-framework-Cocoa`) for correct coordinate scaling on Retina displays.
-- **Linux:** Requires an active graphical session (e.g., X server or Wayland with X compatibility) for `pynput` to function. CI uses workarounds (skipping GUI tests or `xvfb`).
+- `uv` installed (`pip install uv`)
+- **Linux Runtime Requirement:** Requires an active graphical session (X11/Wayland) for `pynput`. May need system libraries (`libx11-dev`, etc.) - see `pynput` docs.
+
+*(macOS display scaling dependencies are handled automatically during installation).*
### For AWS Deployment Features
-The automated deployment of the OmniParser server (`omnimcp/omniparser/server.py`, triggered by `OmniParserClient` when no URL is provided) requires AWS credentials. These are loaded via `pydantic-settings` from a `.env` file in the project root or from environment variables. Ensure you have configured:
+Requires AWS credentials in `.env` (see `.env.example`). **Warning:** Creates AWS resources (EC2, Lambda, etc.) incurring costs. Use `python -m omnimcp.omniparser.server stop` to clean up.
```.env
AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY
AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY
-# AWS_REGION=us-east-1 # Optional, defaults work
-ANTHROPIC_API_KEY=YOUR_ANTHROPIC_KEY # Needed for LLM planning
-# OMNIPARSER_URL=http://... # Optional: Specify if NOT using auto-deploy
+ANTHROPIC_API_KEY=YOUR_ANTHROPIC_KEY
+# OMNIPARSER_URL=http://... # Optional: Skip auto-deploy
```
-**Warning:** Using the automated deployment will create and manage AWS resources (EC2 `g4dn.xlarge`, Lambda, CloudWatch Alarms, IAM Roles, Security Groups) in your account, which **will incur costs**. The system includes an auto-shutdown mechanism based on CPU inactivity (default ~60 minutes), but always remember to use `python -m omnimcp.omniparser.server stop` to clean up resources manually when finished to guarantee termination and avoid unexpected charges.
-
## Installation
-Currently, installation is from source only.
-
```bash
-# 1. Clone the repository
git clone [https://github.com/OpenAdaptAI/OmniMCP.git](https://github.com/OpenAdaptAI/OmniMCP.git)
cd OmniMCP
-
-# 2. Setup environment and install dependencies
-# Ensure uv is installed (pip install uv)
-./install.sh # Creates .venv, activates, installs deps using uv
-
-# 3. Configure API Keys and AWS Credentials
+./install.sh # Creates .venv, installs deps incl. test extras
cp .env.example .env
-# Edit .env file to add your ANTHROPIC_API_KEY and AWS credentials
-
-# To activate the environment in the future:
-# source .venv/bin/activate # Linux/macOS
-# .venv\Scripts\activate.bat # Windows CMD
-# .venv\Scripts\Activate.ps1 # Windows PowerShell
-```
-*The `./install.sh` script creates a virtual environment using `uv`, activates it, and installs OmniMCP in editable mode along with test dependencies (`uv pip install -e ".[test]"`).*
-
-## Quick Start (Illustrative Example)
-
-**Note:** The `OmniMCP` high-level class and its associated MCP tools (`get_screen_state`, `click_element`, etc.) shown in this example (`omnimcp/omnimcp.py`) are currently experimental and require refactoring to fully align with the core components (like the refactored `InputController`). This example represents the intended future API design. For current functional examples, please see `demo.py` (real action demo) and `demo_synthetic.py` (synthetic UI loop).
-
-```python
-# Example of intended future usage via MCP server
-# from omnimcp import OmniMCP
-# from omnimcp.types import ScreenState # Assuming types are importable
-# import asyncio
-
-# async def main():
-# # Ensure .env file has ANTHROPIC_API_KEY and AWS keys (if using auto-deploy)
-# # OmniMCP might internally create OmniParserClient which handles deployment
-# mcp = OmniMCP() # May trigger deployment if OMNIPARSER_URL not set
-
-# # Get current UI state (would use real screenshot + OmniParser)
-# state: ScreenState = await mcp.get_screen_state()
-# print(f"Found {len(state.elements)} elements on screen.")
-
-# # Analyze specific element (would use LLM + visual state)
-# description = await mcp.describe_element(
-# "the main login button"
-# )
-# print(f"Description: {description}")
-
-# # Interact with UI (would use input controllers)
-# result = await mcp.click_element(
-# "Login button",
-# click_type="single"
-# )
-# if not result.success:
-# print(f"Click failed: {result.error}")
-# else:
-# print("Click successful (basic verification).")
-
-# asyncio.run(main())
-```
-
-## Running the Demos
-
-Ensure your virtual environment is activated (`source .venv/bin/activate` or similar) and `.env` file is configured.
-
-### Real Action Demo (`demo.py`)
-
-This demo uses the LLM planner and executes real mouse/keyboard actions to interact with your desktop UI.
-
-**Warning:** This script takes control of your mouse and keyboard! Close sensitive applications before running.
-
-```bash
-# Run with default goal (Calculator task)
-python demo.py
-
-# Run with a custom goal
-python demo.py "Your natural language goal here"
-
-# Check the images/YYYYMMDD_HHMMSS/ directory for step-by-step visuals
-```
-
-### Synthetic Demo (`demo_synthetic.py`)
-
-This runs the planning loop using generated images (no real UI interaction).
-
-```bash
-python demo_synthetic.py
-# Check the demo_output_multistep/ directory for generated images
+# Edit .env with your keys
+# Activate: source .venv/bin/activate (Linux/macOS) or relevant Windows command
```
-## Verifying Deployment & Parsing (Real Screenshot)
+## Quick Start
-This script tests the EC2 deployment and gets raw data from OmniParser for your current screen. Requires AWS credentials.
+Ensure environment is activated and `.env` is configured.
```bash
-# Ensure AWS credentials are in .env
-python -m tests.test_deploy_and_parse # Use -m to run as module
+# Run default goal (Calculator task)
+python cli.py
-# This will deploy an EC2 instance if needed (takes time!), take a screenshot,
-# send it for parsing, and print the raw JSON result.
+# Run custom goal
+python cli.py --goal "Your goal here"
-# Remember to stop the instance afterwards!
-python -m omnimcp.omniparser.server stop
+# See options
+python cli.py --help
```
+Debug outputs are saved in `runs//`.
-## Core Types
-
-```python
-# omnimcp/types.py (Excerpts)
-from dataclasses import dataclass, field
-from typing import List, Optional, Dict, Any, Tuple, Literal
-from pydantic import BaseModel, Field # Assuming LLMActionPlan moved here
-
-# Define Bounds (assuming normalized coordinates 0.0-1.0)
-Bounds = Tuple[float, float, float, float] # (x, y, width, height)
-
-@dataclass
-class UIElement:
- """Represents a UI element with its properties."""
- id: int # Unique identifier for referencing
- type: str # button, text_field, checkbox, link, text, etc.
- content: str # Text content or accessibility label
- bounds: Bounds # Normalized coordinates (x, y, width, height)
- confidence: float = 1.0 # Detection confidence
- attributes: Dict[str, Any] = field(default_factory=dict) # e.g., {'checked': False}
-
-@dataclass
-class ScreenState:
- """Represents the current state of the screen with UI elements."""
- elements: List[UIElement]
- dimensions: Tuple[int, int] # Actual pixel dimensions
- timestamp: float
-
-@dataclass
-class ActionVerification:
- """Verification data for an action."""
- success: bool
- # before_state: bytes # Screenshot bytes (Optional)
- # after_state: bytes # Screenshot bytes (Optional)
- changes_detected: List[Bounds] # Regions where changes occurred
- confidence: float # Confidence score of verification
-
-@dataclass
-class InteractionResult:
- """Result of an interaction with the UI."""
- success: bool
- element: Optional[UIElement] = None
- error: Optional[str] = None
- context: Dict[str, Any] = field(default_factory=dict)
- verification: Optional[ActionVerification] = None
-
-# Example LLM Action Plan structure (defined in omnimcp/types.py)
-class LLMActionPlan(BaseModel):
- reasoning: str = Field(...)
- action: Literal["click", "type", "scroll", "press_key"] = Field(...)
- is_goal_complete: bool = Field(...)
- element_id: Optional[int] = Field(default=None)
- text_to_type: Optional[str] = Field(default=None)
- key_info: Optional[str] = Field(default=None)
- # ... includes validators ...
-```
-
-## MCP Implementation and Framework API
-
-**Note:** The `OmniMCP` class (`omnimcp/omnimcp.py`) providing an MCP server interface is currently experimental and less up-to-date than the core logic used in `demo.py`.
-
-### Target API (via MCP)
-
-```python
-async def get_screen_state() -> ScreenState:
- """Get current state of visible UI elements"""
-
-async def describe_element(description: str) -> str:
- """Get rich description of UI element"""
-
-async def find_elements(query: str, max_results: int = 5) -> List[UIElement]:
- """Find elements matching natural query"""
-
-async def click_element(description: str, click_type: Literal["single", "double", "right"] = "single") -> InteractionResult:
- """Click UI element matching description"""
-
-async def type_text(text: str, target: Optional[str] = None) -> TypeResult:
- """Type text, optionally clicking a target element first"""
-
-async def press_key(key_info: str) -> InteractionResult:
- """Press key or combination (e.g. "Enter", "Cmd+C")"""
-
-# ... other potential actions like scroll_view ...
-```
+**Note on MCP Server:** An experimental MCP server (`OmniMCP` class in `omnimcp/omnimcp.py`) exists but is separate from the primary `cli.py`/`AgentExecutor` workflow.
## Architecture
-1. **Visual State Manager** (`omnimcp/omnimcp.py` - `VisualState` class)
- * Takes screenshot.
- * Calls OmniParser Client.
- * Maps results to `UIElement` list.
-2. **OmniParser Client & Deploy** (`omnimcp/omniparser/`)
- * Manages communication with the OmniParser backend.
- * Handles automated deployment and auto-shutdown of OmniParser on EC2.
-3. **LLM Planner** (`omnimcp/core.py`)
- * Takes goal, history, platform, and current `UIElement` list.
- * Prompts LLM (e.g., Claude) to determine the next best action.
- * Parses structured JSON response (`LLMActionPlan`).
-4. **Input Controller** (`omnimcp/input.py`)
- * Wraps `pynput` for mouse clicks, keyboard typing/combos, scrolling. Handles coordinate scaling.
-5. **(Optional) MCP Server** (`omnimcp/omnimcp.py` - `OmniMCP` class)
- * Exposes functionality as MCP tools (Experimental).
+1. **CLI** (`cli.py`) - Entry point, setup, starts Executor.
+2. **Agent Executor** (`omnimcp/agent_executor.py`) - Orchestrates loop, manages state/artifacts.
+3. **Visual State Manager** (`omnimcp/omnimcp.py`) - Perception (screenshot, calls parser).
+4. **OmniParser Client & Deploy** (`omnimcp/omniparser/`) - Manages OmniParser server communication/deployment.
+5. **LLM Planner** (`omnimcp/core.py`) - Generates action plan.
+6. **Input Controller** (`omnimcp/input.py`) - Executes actions (mouse/keyboard).
+7. **(Optional) MCP Server** (`omnimcp/omnimcp.py`) - Experimental MCP interface.
## Development
-### Environment Setup
+### Environment Setup & Checks
```bash
-# Clone repo and cd into it (see Installation)
-./install.sh # Creates .venv, activates, installs dependencies
-# Activate env if needed: source .venv/bin/activate or .venv\Scripts\activate
-```
-
-### Running Checks
-```bash
-# Activate environment: source .venv/bin/activate
-uv run ruff format .
-uv run ruff check . --fix
-
-# Run tests (skips GUI-dependent tests on headless Linux)
-uv run pytest tests/
+# Setup (if not done): ./install.sh
+# Activate env: source .venv/bin/activate (or similar)
+# Format/Lint: uv run ruff format . && uv run ruff check . --fix
+# Run tests: uv run pytest tests/
```
### Debug Support
+Running `python cli.py` saves timestamped runs in `runs/`, including:
+* `step_N_state_raw.png`
+* `step_N_state_parsed.png` (with element boxes)
+* `step_N_action_highlight.png` (with action highlight)
+* `final_state.png`
-The `demo.py` script automatically saves detailed visual state information for each step into timestamped directories under `images/`, including:
-* `step_N_state_raw.png`: The raw screenshot captured.
-* `step_N_state_parsed.png`: The screenshot with bounding boxes and IDs overlaid for all detected elements.
-* `step_N_action_highlight.png`: The screenshot dimmed, with the target element highlighted (if applicable) and the planned action annotated.
-
-*(Note: This section depends on the `OmniMCP` class refactor)*
-```python
-# Example usage assuming OmniMCP class is functional
-# from omnimcp import OmniMCP, DebugContext # Assuming DebugContext exists
-#
-# # Enable debug mode
-# mcp = OmniMCP(debug=True)
-#
-# # ... perform actions ...
-#
-# # Get debug context (example structure)
-# # debug_info: DebugContext = await mcp.get_debug_context()
-# # print(f"Last operation: {debug_info.tool_name}")
-# # print(f"Duration: {debug_info.duration}ms")
-```
-
-## Configuration
-
-OmniMCP uses a `.env` file in the project root for configuration, loaded via `omnimcp/config.py`. See `.env.example`.
-
-Key variables:
-```dotenv
-# Required for LLM planning
-ANTHROPIC_API_KEY=sk-ant-api03-...
-
-# Required for EC2 deployment features (if not using OMNIPARSER_URL)
-AWS_ACCESS_KEY_ID=YOUR_AWS_ACCESS_KEY
-AWS_SECRET_ACCESS_KEY=YOUR_AWS_SECRET_KEY
-AWS_REGION=us-east-1 # Or your preferred region
-
-# Optional: URL for a manually managed OmniParser server
-# OMNIPARSER_URL=http://:8000
+Detailed logs are in `logs/run_YYYY-MM-DD_HH-mm-ss.log` (`LOG_LEVEL=DEBUG` in `.env` recommended).
-# Optional: EC2 Instance configuration (defaults provided)
-# AWS_EC2_INSTANCE_TYPE=g4dn.xlarge
-# INACTIVITY_TIMEOUT_MINUTES=60
-
-# Optional: Logging level
-# LOG_LEVEL=DEBUG
+
+Example Log Snippet (Auto-Deploy + Agent Step)
+
+```log
+# --- Initialization & Auto-Deploy ---
+2025-MM-DD HH:MM:SS | INFO | omnimcp.omniparser.client:... - No server_url provided, attempting discovery/deployment...
+2025-MM-DD HH:MM:SS | INFO | omnimcp.omniparser.server:... - Creating new EC2 instance...
+2025-MM-DD HH:MM:SS | SUCCESS | omnimcp.omniparser.server:... - Instance i-... is running. Public IP: ...
+2025-MM-DD HH:MM:SS | INFO | omnimcp.omniparser.server:... - Setting up auto-shutdown infrastructure...
+2025-MM-DD HH:MM:SS | SUCCESS | omnimcp.omniparser.server:... - Auto-shutdown infrastructure setup completed...
+... (SSH connection, Docker setup) ...
+2025-MM-DD HH:MM:SS | SUCCESS | omnimcp.omniparser.client:... - Auto-deployment successful. Server URL: http://...
+... (Agent Executor Init) ...
+
+# --- Agent Execution Loop Example Step ---
+2025-MM-DD HH:MM:SS | INFO | omnimcp.agent_executor:run:... - --- Step N/10 ---
+2025-MM-DD HH:MM:SS | DEBUG | omnimcp.agent_executor:run:... - Perceiving current screen state...
+2025-MM-DD HH:MM:SS | INFO | omnimcp.omnimcp:update:... - VisualState update complete. Found X elements. Took Y.YYs.
+2025-MM-DD HH:MM:SS | INFO | omnimcp.agent_executor:run:... - Perceived state with X elements.
+... (Save artifacts) ...
+2025-MM-DD HH:MM:SS | DEBUG | omnimcp.agent_executor:run:... - Planning next action...
+... (LLM Call) ...
+2025-MM-DD HH:MM:SS | INFO | omnimcp.agent_executor:run:... - LLM Plan: Action=..., TargetID=..., GoalComplete=False
+2025-MM-DD HH:MM:SS | DEBUG | omnimcp.agent_executor:run:... - Added to history: Step N: Planned action ...
+2025-MM-DD HH:MM:SS | INFO | omnimcp.agent_executor:run:... - Executing action: ...
+2025-MM-DD HH:MM:SS | SUCCESS | omnimcp.agent_executor:run:... - Action executed successfully.
+2025-MM-DD HH:MM:SS | DEBUG | omnimcp.agent_executor:run:... - Step N duration: Z.ZZs
+... (Loop continues or finishes) ...
```
+*(Note: Details like timings, counts, IPs, instance IDs, and specific plans will vary)*
+
-## Performance Considerations
-
-1. **Visual Analysis (OmniParser Latency)**: Currently the main bottleneck, with perception steps taking 10-20+ seconds via the remote EC2 instance. Optimization needed (local model, faster instance, parallelization, caching).
-2. **State Management**: Current approach re-parses the full screen each step. Future work includes diffing, caching, and background processing.
-3. **Element Targeting**: Basic search used; LLM-based or vector search planned.
-4. **LLM Latency**: API calls add several seconds per step.
+## Roadmap & Limitations
-## Limitations and Future Work
+Key limitations & future work areas:
-Current limitations include:
-- **Performance:** High latency in visual perception step needs significant optimization.
-- **Visual Parsing Accuracy:** OmniParser's ability to reliably detect certain elements (e.g., Spotlight input) needs more validation/improvement. Highlight visualization accuracy depends on this.
-- **LLM Robustness:** Planning can be brittle for complex goals or unexpected UI states; requires more sophisticated prompting or planning techniques.
-- **Element Context:** Truncating the element list sent to the LLM is suboptimal.
-- Core `OmniMCP` class / MCP server API is experimental.
-- E2E tests require refactoring and reliable CI setup (e.g., with xvfb).
+* **Performance:** Reduce OmniParser latency (explore local models, caching, etc.) and optimize state management (avoid full re-parse).
+* **Robustness:** Improve LLM planning reliability (prompts, techniques like ReAct), add action verification/error recovery, enhance element targeting.
+* **Target API/Architecture:** Evolve towards a higher-level declarative API (e.g., `@omni.publish` style) and potentially integrate loop logic with the experimental MCP Server (`OmniMCP` class).
+* **Consistency:** Refactor `demo_synthetic.py` to use `AgentExecutor`.
+* **Features:** Expand action space (drag/drop, hover).
+* **Testing:** Add E2E tests, broaden cross-platform validation, define evaluation metrics.
+* **Research:** Explore fine-tuning, process graphs (RAG), framework integration.
-### Future Research Directions
+## Project Status
-Beyond reinforcement learning integration, we plan to explore:
-- Fine-tuning Specialized Models for UI tasks.
-- Process Graph Embeddings with RAG for interaction pattern retrieval.
-- Development of comprehensive evaluation metrics for UI agents.
-- Enhanced cross-platform generalization (testing on Windows, other Linux distros).
-- Integration with broader LLM agent frameworks.
-- Collaborative multi-agent UI automation frameworks.
+Core loop via `cli.py`/`AgentExecutor` is functional for basic tasks. Performance and robustness need significant improvement. MCP integration is experimental.
## Contributing
1. Fork repository
2. Create feature branch
-3. Implement changes
-4. Add tests (and ensure existing ones pass or are appropriately marked)
+3. Implement changes & add tests
+4. Ensure checks pass (`uv run ruff format .`, `uv run ruff check . --fix`, `uv run pytest tests/`)
5. Submit pull request
## License
MIT License
-## Project Status
-
-Core end-to-end loop (Perception -> Planning -> Action) is functional for basic tasks demonstrated in `demo.py`. Performance and robustness require significant improvement. API is experimental.
-
----
-
## Contact
- Issues: [GitHub Issues](https://github.com/OpenAdaptAI/OmniMCP/issues)
diff --git a/cli.py b/cli.py
new file mode 100644
index 0000000..65d9b5c
--- /dev/null
+++ b/cli.py
@@ -0,0 +1,188 @@
+# cli.py
+
+"""
+Command-line interface for running OmniMCP agent tasks using AgentExecutor.
+"""
+
+import platform
+import sys
+import time
+
+import fire
+
+# Import necessary components from the project
+from omnimcp.agent_executor import AgentExecutor
+from omnimcp.config import config
+from omnimcp.core import plan_action_for_ui
+from omnimcp.input import InputController, _pynput_error # Check pynput import status
+from omnimcp.omniparser.client import OmniParserClient
+from omnimcp.omnimcp import VisualState
+from omnimcp.utils import (
+ logger,
+ draw_bounding_boxes,
+ draw_action_highlight,
+ NSScreen, # Check for AppKit on macOS
+)
+
+
+# Default configuration
+DEFAULT_OUTPUT_DIR = "runs"
+DEFAULT_MAX_STEPS = 10
+DEFAULT_GOAL = "Open calculator and compute 5 * 9"
+
+
+def run(
+ goal: str = DEFAULT_GOAL,
+ max_steps: int = DEFAULT_MAX_STEPS,
+ output_dir: str = DEFAULT_OUTPUT_DIR,
+):
+ """
+ Runs the OmniMCP agent to achieve a specified goal.
+
+ Args:
+ goal: The natural language goal for the agent.
+ max_steps: Maximum number of steps to attempt.
+ output_dir: Base directory to save run artifacts (timestamped subdirs).
+ """
+ # --- Initial Checks ---
+ logger.info("--- OmniMCP CLI ---")
+ logger.info("Performing initial checks...")
+ success = True
+
+ # 1. API Key Check
+ if not config.ANTHROPIC_API_KEY:
+ logger.critical(
+ "❌ ANTHROPIC_API_KEY not found in config or .env file. LLM planning requires this."
+ )
+ success = False
+ else:
+ logger.info("✅ ANTHROPIC_API_KEY found.")
+
+ # 2. pynput Check
+ if _pynput_error:
+ logger.critical(
+ f"❌ Input control library (pynput) failed to load: {_pynput_error}"
+ )
+ logger.critical(
+ " Real action execution will not work. Is it installed and prerequisites met (e.g., display server)?"
+ )
+ success = False
+ else:
+ logger.info("✅ Input control library (pynput) loaded.")
+
+ # 3. macOS Scaling Check
+ if platform.system() == "darwin":
+ if not NSScreen:
+ logger.warning(
+ "⚠️ AppKit (pyobjc-framework-Cocoa) not found or failed to import."
+ )
+ logger.warning(
+ " Coordinate scaling for Retina displays may be incorrect. Install with 'uv pip install pyobjc-framework-Cocoa'."
+ )
+ else:
+ logger.info("✅ AppKit found for macOS scaling.")
+
+ if not success:
+ logger.error("Prerequisite checks failed. Exiting.")
+ sys.exit(1)
+
+ # --- Component Initialization ---
+ logger.info("\nInitializing components...")
+ try:
+ # OmniParser Client (handles deployment if URL not set)
+ parser_client = OmniParserClient(
+ server_url=config.OMNIPARSER_URL, auto_deploy=(not config.OMNIPARSER_URL)
+ )
+ logger.info(f" - OmniParserClient ready (URL: {parser_client.server_url})")
+
+ # Perception Component
+ visual_state = VisualState(parser_client=parser_client)
+ logger.info(" - VisualState (Perception) ready.")
+
+ # Execution Component
+ controller = InputController()
+ logger.info(" - InputController (Execution) ready.")
+
+ # Planner Function (already imported)
+ logger.info(" - LLM Planner function ready.")
+
+ # Visualization Functions (already imported)
+ logger.info(" - Visualization functions ready.")
+
+ except ImportError as e:
+ logger.critical(
+ f"❌ Component initialization failed due to missing dependency: {e}"
+ )
+ logger.critical(
+ " Ensure all requirements are installed (`uv pip install -e .`)"
+ )
+ sys.exit(1)
+ except Exception as e:
+ logger.critical(f"❌ Component initialization failed: {e}", exc_info=True)
+ sys.exit(1)
+
+ # --- Agent Executor Initialization ---
+ logger.info("\nInitializing Agent Executor...")
+ try:
+ agent_executor = AgentExecutor(
+ perception=visual_state,
+ planner=plan_action_for_ui,
+ execution=controller,
+ box_drawer=draw_bounding_boxes,
+ highlighter=draw_action_highlight,
+ )
+ logger.success("✅ Agent Executor initialized successfully.")
+ except Exception as e:
+ logger.critical(f"❌ Agent Executor initialization failed: {e}", exc_info=True)
+ sys.exit(1)
+
+ # --- User Confirmation & Start ---
+ print("\n" + "=" * 60)
+ print(" WARNING: This script WILL take control of your mouse and keyboard!")
+ print(f" TARGET OS: {platform.system()}")
+ print(" Please ensure no sensitive information is visible on screen.")
+ print(" To stop execution manually: Move mouse RAPIDLY to a screen corner")
+ print(" OR press Ctrl+C in the terminal.")
+ print("=" * 60 + "\n")
+ for i in range(5, 0, -1):
+ print(f"Starting in {i}...", end="\r")
+ time.sleep(1)
+ print("Starting agent run now! ")
+
+ # --- Run the Agent ---
+ overall_success = False
+ try:
+ overall_success = agent_executor.run(
+ goal=goal,
+ max_steps=max_steps,
+ output_base_dir=output_dir,
+ )
+ except KeyboardInterrupt:
+ logger.warning("\nExecution interrupted by user (Ctrl+C).")
+ sys.exit(1)
+ except Exception as run_e:
+ logger.critical(
+ f"\nAn unexpected error occurred during the agent run: {run_e}",
+ exc_info=True,
+ )
+ sys.exit(1)
+ finally:
+ # Optional: Add cleanup here if needed (e.g., stopping parser server)
+ logger.info(
+ "Reminder: If using auto-deploy, stop the parser server with "
+ "'python -m omnimcp.omniparser.server stop' when finished."
+ )
+
+ # --- Exit ---
+ if overall_success:
+ logger.success("\nAgent run finished successfully (goal achieved).")
+ sys.exit(0)
+ else:
+ logger.error(
+ "\nAgent run finished unsuccessfully (goal not achieved or error occurred)."
+ )
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ fire.Fire(run)
diff --git a/demo.py b/demo.py
deleted file mode 100644
index 33395de..0000000
--- a/demo.py
+++ /dev/null
@@ -1,421 +0,0 @@
-# demo.py
-"""
-OmniMCP Demo: Real Perception -> LLM Planner -> Real Action Execution.
-Saves detailed debug images for each step in timestamped directories.
-"""
-
-import platform
-import os
-import time
-import sys
-import datetime # Import datetime
-from typing import List, Optional
-
-from PIL import Image
-import fire
-
-# Import necessary components from the project
-from omnimcp.omniparser.client import OmniParserClient
-from omnimcp.omnimcp import VisualState
-from omnimcp.core import plan_action_for_ui, LLMActionPlan
-from omnimcp.input import InputController
-from omnimcp.utils import (
- logger,
- denormalize_coordinates,
- take_screenshot,
- draw_bounding_boxes, # Import the new drawing function
- get_scaling_factor,
- draw_action_highlight,
-)
-from omnimcp.config import config
-from omnimcp.types import UIElement
-
-
-# --- Configuration ---
-# OUTPUT_DIR is now dynamically created per run
-# SAVE_IMAGES = True # Always save images in this version
-MAX_STEPS = 10
-
-
-def run_real_planner_demo(
- user_goal: str = "Open calculator and compute 5 * 9",
-):
- """
- Runs the main OmniMCP demo loop: Perception -> Planning -> Action.
- Saves detailed debug images to images/{timestamp}/ folder.
-
- Args:
- user_goal: The natural language goal for the agent to achieve.
-
- Returns:
- True if the goal was achieved or max steps were reached without critical error,
- False otherwise.
- """
- run_timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
- run_output_dir = os.path.join("images", run_timestamp)
- os.makedirs(run_output_dir, exist_ok=True)
- logger.info("--- Starting OmniMCP Demo ---")
- logger.info(f"Saving outputs to: {run_output_dir}")
-
- scaling_factor = get_scaling_factor()
- logger.info(f"Using display scaling factor: {scaling_factor}")
-
- # 1. Initialize Client, State Manager, and Controller
- if not config.ANTHROPIC_API_KEY:
- logger.error("ANTHROPIC_API_KEY not found in config. Cannot run planner.")
- return False # Indicate failure
- logger.info("Initializing OmniParserClient, VisualState, and InputController...")
- try:
- parser_client = OmniParserClient(
- server_url=config.OMNIPARSER_URL, auto_deploy=(not config.OMNIPARSER_URL)
- )
- visual_state = VisualState(parser_client=parser_client)
- controller = InputController()
- logger.success(
- f"Client, VisualState, Controller initialized. Parser URL: {parser_client.server_url}"
- )
- except ImportError as e:
- logger.error(
- f"Initialization failed due to missing dependency: {e}. Is pynput or pyobjc installed?"
- )
- return False
- except Exception as e:
- logger.error(f"Initialization failed: {e}", exc_info=True)
- return False
-
- # 2. User Goal
- logger.info(f"User Goal: '{user_goal}'")
-
- action_history: List[str] = []
- goal_achieved = False
- # Tracks if loop broke due to error vs completing/reaching max steps
- final_step_success = True
- last_step_completed = -1 # Track the index of the last fully completed step
-
- # --- Main Loop ---
- for step in range(MAX_STEPS):
- logger.info(f"\n--- Step {step + 1}/{MAX_STEPS} ---")
- step_start_time = time.time()
- # Use 1-based index for user-friendly filenames
- step_img_prefix = f"step_{step + 1}"
-
- # 3. Get CURRENT REAL State (Screenshot -> Parse -> Map)
- logger.info("Getting current screen state...")
- current_image: Optional[Image.Image] = None
- current_elements: List[UIElement] = []
- try:
- visual_state.update() # Synchronous update
- current_elements = visual_state.elements or []
- current_image = visual_state._last_screenshot
-
- if not current_image:
- logger.error("Failed to get screenshot for current state. Stopping.")
- final_step_success = False
- break # Exit loop
-
- logger.info(
- f"Current state captured with {len(current_elements)} elements."
- )
-
- # Save Raw State Image
- raw_state_path = os.path.join(
- run_output_dir, f"{step_img_prefix}_state_raw.png"
- )
- try:
- current_image.save(raw_state_path)
- logger.info(f"Saved raw state to {raw_state_path}")
- except Exception as save_e:
- logger.warning(f"Could not save raw state image: {save_e}")
-
- # Save Parsed State Image (with bounding boxes)
- parsed_state_path = os.path.join(
- run_output_dir, f"{step_img_prefix}_state_parsed.png"
- )
- try:
- # Ensure draw_bounding_boxes is available
- img_with_boxes = draw_bounding_boxes(
- current_image, current_elements, color="lime", show_ids=True
- )
- img_with_boxes.save(parsed_state_path)
- logger.info(f"Saved parsed state visualization to {parsed_state_path}")
- except NameError:
- logger.warning(
- "draw_bounding_boxes function not found, cannot save parsed state image."
- )
- except Exception as draw_e:
- logger.warning(f"Could not save parsed state image: {draw_e}")
-
- except Exception as e:
- logger.error(f"Failed to get visual state: {e}", exc_info=True)
- final_step_success = False
- break # Stop loop if state update fails
-
- # 4. Plan Next Action using LLM Planner
- logger.info("Planning action with LLM...")
- llm_plan: Optional[LLMActionPlan] = None
- target_element: Optional[UIElement] = None
- try:
- llm_plan, target_element = plan_action_for_ui(
- elements=current_elements,
- user_goal=user_goal,
- action_history=action_history,
- step=step, # Pass 0-based step index for conditional logging
- )
- logger.info(f"LLM Reasoning: {llm_plan.reasoning}")
- logger.info(f"LLM Goal Complete Assessment: {llm_plan.is_goal_complete}")
-
- except Exception as plan_e:
- logger.error(f"Error during LLM planning: {plan_e}", exc_info=True)
- final_step_success = False
- break # Stop loop if planning fails
-
- # 5. Check for Goal Completion BEFORE acting
- if llm_plan.is_goal_complete:
- logger.success("LLM determined the goal is achieved!")
- goal_achieved = True
- last_step_completed = step # Mark this step as completed before breaking
- break # Exit loop successfully
-
- # 6. Validate Target Element (Ensure click has a target)
- if llm_plan.action == "click" and target_element is None:
- logger.error(
- f"Action 'click' requires element ID {llm_plan.element_id}, but it was not found in the current state. Stopping."
- )
- final_step_success = False
- break # Stop loop if required element is missing
-
- # 7. Visualize Planned Action (Highlight Target OR Annotate Action)
- if llm_plan and current_image: # Check if we have a plan and image
- highlight_img_path = os.path.join(
- run_output_dir, f"{step_img_prefix}_action_highlight.png"
- )
- try:
- # Call the function - it handles None element internally
- highlighted_image = draw_action_highlight(
- current_image,
- target_element, # Pass element (can be None)
- plan=llm_plan,
- color="red",
- width=3,
- )
- highlighted_image.save(highlight_img_path)
- logger.info(f"Saved action visualization to {highlight_img_path}")
- except Exception as draw_e:
- logger.warning(f"Could not save action visualization image: {draw_e}")
-
- # Record action for history BEFORE execution
- action_desc = f"Step {step + 1}: Planned {llm_plan.action}"
- if target_element:
- action_desc += (
- f" on ID {target_element.id} ('{target_element.content[:30]}...')"
- )
- if llm_plan.text_to_type:
- action_desc += f" Text='{llm_plan.text_to_type[:20]}...'"
- if llm_plan.key_info:
- action_desc += f" Key='{llm_plan.key_info}'"
- action_history.append(action_desc)
- logger.debug(f"Added to history: {action_desc}")
-
- # 8. Execute REAL Action using InputController
- logger.info(f"Executing action: {llm_plan.action}...")
- action_success = False
- try:
- if visual_state.screen_dimensions is None:
- # Should not happen if screenshot was taken, but safety check
- raise RuntimeError("Cannot execute action: screen dimensions unknown.")
- # screen_w/h are physical pixel dimensions from screenshot
- screen_w, screen_h = visual_state.screen_dimensions
-
- if llm_plan.action == "click":
- if target_element: # Validation already done
- # Denormalize to get PHYSICAL PIXEL coordinates for center
- abs_x, abs_y = denormalize_coordinates(
- target_element.bounds[0],
- target_element.bounds[1],
- screen_w,
- screen_h,
- target_element.bounds[2],
- target_element.bounds[3],
- )
- # Convert to LOGICAL points for pynput controller
- logical_x = int(abs_x / scaling_factor)
- logical_y = int(abs_y / scaling_factor)
- logger.info(
- f"Converted physical click ({abs_x},{abs_y}) to logical ({logical_x},{logical_y}) using factor {scaling_factor}"
- )
- action_success = controller.click(
- logical_x, logical_y, click_type="single"
- )
- # No else needed, already validated above
-
- elif llm_plan.action == "type":
- if llm_plan.text_to_type is not None:
- if target_element: # Click if target specified
- # Denormalize to get PHYSICAL PIXEL coordinates for center
- abs_x, abs_y = denormalize_coordinates(
- target_element.bounds[0],
- target_element.bounds[1],
- screen_w,
- screen_h,
- target_element.bounds[2],
- target_element.bounds[3],
- )
- # Convert to LOGICAL points for pynput controller
- logical_x = int(abs_x / scaling_factor)
- logical_y = int(abs_y / scaling_factor)
- logger.info(
- f"Clicking element {target_element.id} at logical ({logical_x},{logical_y}) before typing..."
- )
- clicked_before_type = controller.click(logical_x, logical_y)
- if not clicked_before_type:
- logger.warning(
- "Failed to click target element before typing, attempting to type anyway."
- )
- # Allow time for focus to shift after click
- time.sleep(0.2)
- else:
- # No target element specified (e.g., typing into Spotlight after Cmd+Space)
- logger.info(
- "No target element specified for type action, assuming focus is correct."
- )
-
- # Typing uses its own pynput method
- action_success = controller.type_text(llm_plan.text_to_type)
- else:
- logger.error(
- "Type planned but text_to_type is null."
- ) # Should be caught by Pydantic
-
- elif llm_plan.action == "press_key":
- if llm_plan.key_info:
- action_success = controller.execute_key_string(llm_plan.key_info)
- else:
- logger.error(
- "Press_key planned but key_info is null."
- ) # Should be caught by Pydantic
-
- elif llm_plan.action == "scroll":
- # Basic scroll, direction might be inferred crudely from reasoning
- # Scroll amount units depend on pynput/OS, treat as steps/lines
- scroll_dir = llm_plan.reasoning.lower()
- scroll_amount_steps = 3 # Scroll N steps/lines
- scroll_dy = (
- -scroll_amount_steps
- if "down" in scroll_dir
- else scroll_amount_steps
- if "up" in scroll_dir
- else 0
- )
- scroll_dx = (
- -scroll_amount_steps
- if "left" in scroll_dir
- else scroll_amount_steps
- if "right" in scroll_dir
- else 0
- )
-
- if scroll_dx != 0 or scroll_dy != 0:
- action_success = controller.scroll(scroll_dx, scroll_dy)
- else:
- logger.warning(
- "Scroll planned, but direction unclear or zero amount. Skipping scroll."
- )
- action_success = True # No action needed counts as success here
-
- else:
- # Should not happen if LLM plan validation works
- logger.warning(
- f"Action type '{llm_plan.action}' execution not implemented."
- )
- action_success = False
-
- # Check action result and break loop if failed
- if action_success:
- logger.success("Action executed successfully.")
- else:
- logger.error(
- f"Action '{llm_plan.action}' execution failed or was skipped."
- )
- final_step_success = False
- break # Stop loop if low-level action failed
-
- except Exception as exec_e:
- # Catch unexpected errors during execution block
- logger.error(f"Error during action execution: {exec_e}", exc_info=True)
- final_step_success = False
- break # Stop loop on execution error
-
- # Mark step as completed successfully before proceeding
- last_step_completed = step
-
- # Wait for UI to settle after the action
- time.sleep(1.5) # Adjust as needed
- logger.info(f"Step {step + 1} duration: {time.time() - step_start_time:.2f}s")
-
- # --- End of Loop ---
- logger.info("\n--- Demo Finished ---")
- if goal_achieved:
- logger.success("Overall goal marked as achieved by LLM.")
- # Check if loop completed all steps successfully OR broke early due to goal achieved
- elif final_step_success and (last_step_completed == MAX_STEPS - 1 or goal_achieved):
- if not goal_achieved: # Means max steps reached
- logger.warning(
- f"Reached maximum steps ({MAX_STEPS}) without goal completion."
- )
- # If goal_achieved is True, success message already printed
- else:
- # Loop broke early due to an error
- logger.error(
- f"Execution stopped prematurely after Step {last_step_completed + 1} due to an error."
- )
-
- # Save the VERY final screen state
- logger.info("Capturing final screen state...")
- final_image = take_screenshot()
- if final_image:
- final_state_img_path = os.path.join(run_output_dir, "final_state.png")
- try:
- final_image.save(final_state_img_path)
- logger.info(f"Saved final screen state to {final_state_img_path}")
- except Exception as save_e:
- logger.warning(f"Could not save final state image: {save_e}")
-
- logger.info(f"Debug images saved in: {run_output_dir}")
- logger.info(
- "Reminder: Run 'python -m omnimcp.omniparser.server stop' to shut down the EC2 instance if deployed."
- )
-
- # Return True if goal was achieved, or if max steps were reached without error
- return goal_achieved or (
- final_step_success and last_step_completed == MAX_STEPS - 1
- )
-
-
-if __name__ == "__main__":
- if not config.ANTHROPIC_API_KEY:
- print("ERROR: ANTHROPIC_API_KEY missing.")
- sys.exit(1)
-
- print("\n" + "=" * 60)
- print(" WARNING: This script WILL take control of your mouse and keyboard!")
- print(f" TARGET OS: {platform.system()}")
- print(" Please ensure no sensitive information is visible on screen.")
- print(" To stop execution manually: Move mouse RAPIDLY to a screen corner")
- print(" OR press Ctrl+C in the terminal.")
- print("=" * 60 + "\n")
- for i in range(5, 0, -1):
- print(f"Starting in {i}...", end="\r")
- time.sleep(1)
- print("Starting now! ")
-
- try:
- # Use fire to handle CLI arguments for run_real_planner_demo
- fire.Fire(run_real_planner_demo)
- # Assume success if fire completes without raising an exception here
- sys.exit(0)
- except KeyboardInterrupt:
- logger.warning("Execution interrupted by user (Ctrl+C).")
- sys.exit(1)
- except Exception:
- logger.exception("An unexpected error occurred during the demo execution.")
- sys.exit(1)
diff --git a/demo_synthetic.py b/demo_synthetic.py
index 907e253..5d329e1 100644
--- a/demo_synthetic.py
+++ b/demo_synthetic.py
@@ -1,8 +1,12 @@
# demo_synthetic.py
+"""
+OmniMCP Demo: Synthetic Perception -> LLM Planner -> Synthetic Action Validation.
+Generates UI images and simulates the loop without real screen interaction.
+"""
import os
import time
-from typing import List, Optional # Import Any for plan typing
+from typing import List, Optional
# Import necessary components from the project
from omnimcp.synthetic_ui import (
@@ -10,17 +14,30 @@
simulate_action,
draw_highlight, # Use the original draw_highlight from synthetic_ui
)
-from omnimcp.core import plan_action_for_ui, LLMActionPlan # Import the Pydantic model
-from omnimcp.utils import logger # Assuming logger is configured elsewhere
-from omnimcp.types import UIElement # Import UIElement
+from omnimcp.core import plan_action_for_ui, LLMActionPlan
+from omnimcp.utils import logger
+from omnimcp.types import UIElement
+
+# NOTE ON REFACTORING:
+# The main loop structure in this script (run_synthetic_planner_demo) is similar
+# to the core logic now encapsulated in `omnimcp.agent_executor.AgentExecutor`.
+# In the future, this synthetic demo could be refactored to:
+# 1. Create synthetic implementations of the PerceptionInterface and ExecutionInterface.
+# 2. Instantiate AgentExecutor with these synthetic components.
+# 3. Call `agent_executor.run(...)`.
+# This would further consolidate the core loop logic and allow testing the
+# AgentExecutor orchestration with controlled, synthetic inputs/outputs.
+# For now, this script remains separate to demonstrate the synthetic setup
+# independently.
+
# --- Configuration ---
-OUTPUT_DIR = "demo_output_multistep" # Keep original output dir for synthetic demo
+OUTPUT_DIR = "demo_output_multistep"
SAVE_IMAGES = True
-MAX_STEPS = 6 # Keep original max steps for this demo
+MAX_STEPS = 6
-def run_multi_step_demo():
+def run_synthetic_planner_demo():
"""Runs the multi-step OmniMCP demo using synthetic UI and LLM planning."""
logger.info("--- Starting OmniMCP Multi-Step Synthetic Demo ---")
os.makedirs(OUTPUT_DIR, exist_ok=True)
@@ -40,13 +57,13 @@ def run_multi_step_demo():
logger.info(f"User Goal: '{user_goal}'")
action_history: List[str] = []
- goal_achieved_flag = False # Use a flag to signal completion after the step runs
- last_step_completed = -1 # Track last successful step index
+ goal_achieved_flag = False
+ last_step_completed = -1
# --- Main Loop ---
for step in range(MAX_STEPS):
logger.info(f"\n--- Step {step + 1}/{MAX_STEPS} ---")
- step_img_prefix = f"step_{step + 1}" # Use 1-based index for filenames
+ step_img_prefix = f"step_{step + 1}"
# Save/Show current state *before* planning/highlighting
current_state_img_path = os.path.join(
@@ -65,10 +82,10 @@ def run_multi_step_demo():
target_element: Optional[UIElement] = None
try:
llm_plan, target_element = plan_action_for_ui(
- elements=elements, # Pass current elements
+ elements=elements,
user_goal=user_goal,
action_history=action_history,
- step=step, # Pass step index
+ step=step,
)
logger.info(f"LLM Reasoning: {llm_plan.reasoning}")
@@ -81,43 +98,31 @@ def run_multi_step_demo():
logger.info(f"Key Info: '{llm_plan.key_info}'")
logger.info(f"LLM Goal Complete Assessment: {llm_plan.is_goal_complete}")
- # 3. Check for Goal Completion Flag (but don't break loop yet)
+ # 3. Check for Goal Completion Flag
if llm_plan.is_goal_complete:
logger.info(
"LLM flag indicates goal should be complete after this action."
)
- goal_achieved_flag = (
- True # Set flag to break after this step's simulation
- )
+ goal_achieved_flag = True
# --- Updated Validation Check ---
- # Validate target element ONLY IF the goal is NOT yet complete AND action requires it
if not goal_achieved_flag:
- # Click requires a valid target element found in the current state
if llm_plan.action == "click" and not target_element:
logger.error(
f"LLM planned 'click' on invalid element ID ({llm_plan.element_id}). Stopping."
)
- break # Stop if click is impossible
-
- # Type MIGHT require a target in synthetic demo, depending on simulate_action logic
- # If simulate_action assumes type always targets a field, uncomment below
- # if llm_plan.action == "type" and not target_element:
- # logger.error(f"LLM planned 'type' on invalid element ID ({llm_plan.element_id}). Stopping.")
- # break
- # --- End Updated Validation Check ---
+ break
# 4. Visualize Planned Action (uses synthetic_ui.draw_highlight)
highlight_img_path = os.path.join(
OUTPUT_DIR, f"{step_img_prefix}_highlight.png"
)
- if target_element: # Only draw highlight if element exists
+ if target_element:
try:
- # Pass the llm_plan to the draw_highlight function
highlighted_image = draw_highlight(
image,
target_element,
- plan=llm_plan, # Pass the plan object here
+ plan=llm_plan,
color="lime",
width=4,
)
@@ -129,14 +134,27 @@ def run_multi_step_demo():
except Exception as draw_e:
logger.warning(f"Could not save highlight image: {draw_e}")
else:
- logger.info("No target element to highlight for this step.")
+ # For non-element actions like press_key, still save an image showing the state
+ # before the action, potentially adding text annotation later if needed.
+ if SAVE_IMAGES:
+ try:
+ image.save(
+ highlight_img_path.replace(
+ "_highlight.png", "_state_before_no_highlight.png"
+ )
+ )
+ logger.info("No target element, saved pre-action state.")
+ except Exception as save_e:
+ logger.warning(
+ f"Could not save pre-action state image: {save_e}"
+ )
# Record action for history *before* simulation changes state
action_desc = f"Action: {llm_plan.action}"
if llm_plan.text_to_type:
action_desc += f" '{llm_plan.text_to_type}'"
if llm_plan.key_info:
- action_desc += f" Key='{llm_plan.key_info}'" # Add key_info if present
+ action_desc += f" Key='{llm_plan.key_info}'"
if target_element:
action_desc += (
f" on Element ID {target_element.id} ('{target_element.content}')"
@@ -144,9 +162,8 @@ def run_multi_step_demo():
action_history.append(action_desc)
logger.debug(f"Added to history: {action_desc}")
- # 5. Simulate Action -> Get New State (ALWAYS run this for the planned step)
+ # 5. Simulate Action -> Get New State
logger.info("Simulating action...")
- # Extract username now in case login is successful in this step
username = next(
(
el.content
@@ -156,12 +173,10 @@ def run_multi_step_demo():
"User",
)
- # simulate_action needs to handle the LLMActionPlan type
new_image, new_elements = simulate_action(
image, elements, llm_plan, username_for_login=username
)
- # Basic check if state actually changed
state_changed = (
(id(new_image) != id(image))
or (len(elements) != len(new_elements))
@@ -171,7 +186,7 @@ def run_multi_step_demo():
)
)
- image, elements = new_image, new_elements # Update state for next loop
+ image, elements = new_image, new_elements
if state_changed:
logger.info(
@@ -182,7 +197,6 @@ def run_multi_step_demo():
"Simulation did not result in a detectable state change."
)
- # Mark step as completed successfully before checking goal flag or pausing
last_step_completed = step
# 6. NOW check the flag to break *after* simulation
@@ -192,24 +206,21 @@ def run_multi_step_demo():
)
break
- # Pause briefly between steps
time.sleep(1)
except Exception as e:
logger.error(f"Error during step {step + 1}: {e}", exc_info=True)
- break # Stop on error
+ break
# --- End of Loop ---
logger.info("\n--- Multi-Step Synthetic Demo Finished ---")
if goal_achieved_flag:
logger.success("Overall goal marked as achieved by LLM during execution.")
elif last_step_completed == MAX_STEPS - 1:
- # Reached end without goal flag, but no error broke the loop
logger.warning(
f"Reached maximum steps ({MAX_STEPS}) without goal completion flag being set."
)
else:
- # Loop broke early due to error or other condition
logger.error(
f"Execution stopped prematurely after Step {last_step_completed + 1} (check logs)."
)
@@ -225,9 +236,10 @@ def run_multi_step_demo():
if __name__ == "__main__":
- # Add basic check for API key if running this directly
- # (Although synthetic demo doesn't *strictly* need it if core allows planning without it)
- # from omnimcp.config import config # Example if config is needed
+ # Optional: Add check for API key, though planning might work differently
+ # depending on whether core.plan_action_for_ui *requires* the LLM call
+ # or could potentially use non-LLM logic someday.
+ # from omnimcp.config import config
# if not config.ANTHROPIC_API_KEY:
- # print("Warning: ANTHROPIC_API_KEY not found. LLM planning might fail.")
- run_multi_step_demo()
+ # logger.warning("ANTHROPIC_API_KEY not found. LLM planning might fail.")
+ run_synthetic_planner_demo()
diff --git a/omnimcp/agent_executor.py b/omnimcp/agent_executor.py
new file mode 100644
index 0000000..79d1a8b
--- /dev/null
+++ b/omnimcp/agent_executor.py
@@ -0,0 +1,429 @@
+# omnimcp/agent_executor.py
+
+import datetime
+import os
+import time
+from typing import Callable, List, Optional, Tuple, Protocol, Dict
+
+from PIL import Image
+
+
+# Used for type hinting if Protocol is simple:
+from .types import LLMActionPlan, UIElement
+from .utils import (
+ denormalize_coordinates,
+ draw_action_highlight,
+ draw_bounding_boxes,
+ get_scaling_factor,
+ logger,
+ take_screenshot,
+)
+
+
+class PerceptionInterface(Protocol):
+ elements: List[UIElement]
+ screen_dimensions: Optional[Tuple[int, int]]
+ _last_screenshot: Optional[Image.Image]
+
+ def update(self) -> None: ...
+
+
+class ExecutionInterface(Protocol):
+ def click(self, x: int, y: int, click_type: str = "single") -> bool: ...
+ def type_text(self, text: str) -> bool: ...
+ def execute_key_string(self, key_info_str: str) -> bool: ...
+ def scroll(self, dx: int, dy: int) -> bool: ...
+
+
+PlannerCallable = Callable[
+ [List[UIElement], str, List[str], int, str],
+ Tuple[LLMActionPlan, Optional[UIElement]],
+]
+ImageProcessorCallable = Callable[..., Image.Image]
+
+
+# --- Core Agent Executor ---
+
+
+class AgentExecutor:
+ """
+ Orchestrates the perceive-plan-act loop for UI automation tasks.
+ Refactored to use action handlers for clarity.
+ """
+
+ def __init__(
+ self,
+ perception: PerceptionInterface,
+ planner: PlannerCallable,
+ execution: ExecutionInterface,
+ box_drawer: Optional[ImageProcessorCallable] = draw_bounding_boxes,
+ highlighter: Optional[ImageProcessorCallable] = draw_action_highlight,
+ ):
+ self._perception = perception
+ self._planner = planner
+ self._execution = execution
+ self._box_drawer = box_drawer
+ self._highlighter = highlighter
+ self.action_history: List[str] = []
+
+ # Map action names to their handler methods
+ self._action_handlers: Dict[str, Callable[..., bool]] = {
+ "click": self._execute_click,
+ "type": self._execute_type,
+ "press_key": self._execute_press_key,
+ "scroll": self._execute_scroll,
+ }
+ logger.info("AgentExecutor initialized with action handlers.")
+
+ # --- Private Action Handlers ---
+
+ def _execute_click(
+ self,
+ plan: LLMActionPlan,
+ target_element: Optional[UIElement],
+ screen_dims: Tuple[int, int],
+ scaling_factor: int,
+ ) -> bool:
+ """Handles the 'click' action."""
+ if not target_element:
+ logger.error(
+ f"Click action requires target element ID {plan.element_id}, but it's missing."
+ )
+ return False # Should have been caught earlier, but safety check
+
+ screen_w, screen_h = screen_dims
+ # Denormalize to get PHYSICAL PIXEL coordinates for center
+ abs_x, abs_y = denormalize_coordinates(
+ target_element.bounds[0],
+ target_element.bounds[1],
+ screen_w,
+ screen_h,
+ target_element.bounds[2],
+ target_element.bounds[3],
+ )
+ # Convert to LOGICAL points for execution component
+ logical_x = int(abs_x / scaling_factor)
+ logical_y = int(abs_y / scaling_factor)
+ logger.debug(f"Executing click at logical coords: ({logical_x}, {logical_y})")
+ return self._execution.click(logical_x, logical_y, click_type="single")
+
+ def _execute_type(
+ self,
+ plan: LLMActionPlan,
+ target_element: Optional[UIElement],
+ screen_dims: Tuple[int, int],
+ scaling_factor: int,
+ ) -> bool:
+ """Handles the 'type' action."""
+ if plan.text_to_type is None:
+ logger.error("Action 'type' planned but text_to_type is null.")
+ return False # Should be caught by Pydantic validation
+
+ if target_element: # Click target element first if specified
+ screen_w, screen_h = screen_dims
+ abs_x, abs_y = denormalize_coordinates(
+ target_element.bounds[0],
+ target_element.bounds[1],
+ screen_w,
+ screen_h,
+ target_element.bounds[2],
+ target_element.bounds[3],
+ )
+ logical_x = int(abs_x / scaling_factor)
+ logical_y = int(abs_y / scaling_factor)
+ logger.debug(
+ f"Clicking target element {target_element.id} at logical ({logical_x},{logical_y}) before typing..."
+ )
+ if not self._execution.click(logical_x, logical_y):
+ logger.warning(
+ "Failed to click target before typing, attempting type anyway."
+ )
+ time.sleep(0.2) # Pause after click
+
+ logger.debug(f"Executing type: '{plan.text_to_type[:50]}...'")
+ return self._execution.type_text(plan.text_to_type)
+
+ def _execute_press_key(
+ self,
+ plan: LLMActionPlan,
+ target_element: Optional[UIElement], # Unused, but maintains handler signature
+ screen_dims: Tuple[int, int], # Unused
+ scaling_factor: int, # Unused
+ ) -> bool:
+ """Handles the 'press_key' action."""
+ if not plan.key_info:
+ logger.error("Action 'press_key' planned but key_info is null.")
+ return False # Should be caught by Pydantic validation
+ logger.debug(f"Executing press_key: '{plan.key_info}'")
+ return self._execution.execute_key_string(plan.key_info)
+
+ def _execute_scroll(
+ self,
+ plan: LLMActionPlan,
+ target_element: Optional[UIElement], # Unused
+ screen_dims: Tuple[int, int], # Unused
+ scaling_factor: int, # Unused
+ ) -> bool:
+ """Handles the 'scroll' action."""
+ # Basic scroll logic based on reasoning hint
+ scroll_dir = plan.reasoning.lower()
+ scroll_amount_steps = 3
+ scroll_dy = (
+ -scroll_amount_steps
+ if "down" in scroll_dir
+ else scroll_amount_steps
+ if "up" in scroll_dir
+ else 0
+ )
+ scroll_dx = (
+ -scroll_amount_steps
+ if "left" in scroll_dir
+ else scroll_amount_steps
+ if "right" in scroll_dir
+ else 0
+ )
+
+ if scroll_dx != 0 or scroll_dy != 0:
+ logger.debug(f"Executing scroll: dx={scroll_dx}, dy={scroll_dy}")
+ return self._execution.scroll(scroll_dx, scroll_dy)
+ else:
+ logger.warning(
+ "Scroll planned but direction/amount unclear, skipping scroll."
+ )
+ return True # No action needed counts as success
+
+ # Comparison Note:
+ # This `run` method implements an explicit, sequential perceive-plan-act loop.
+ # Alternative agent architectures exist... (rest of comment remains same)
+
+ def run(
+ self, goal: str, max_steps: int = 10, output_base_dir: str = "runs"
+ ) -> bool:
+ """
+ Runs the main perceive-plan-act loop to achieve the goal.
+
+ Args:
+ goal: The natural language goal for the agent.
+ max_steps: Maximum number of steps to attempt.
+ output_base_dir: Base directory to save run artifacts (timestamped).
+
+ Returns:
+ True if the goal was achieved, False otherwise (error or max steps reached).
+ """
+ run_timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
+ run_output_dir = os.path.join(output_base_dir, run_timestamp)
+ try:
+ os.makedirs(run_output_dir, exist_ok=True)
+ logger.info(f"Starting agent run. Goal: '{goal}'")
+ logger.info(f"Saving outputs to: {run_output_dir}")
+ except OSError as e:
+ logger.error(f"Failed to create output directory {run_output_dir}: {e}")
+ return False
+
+ self.action_history = []
+ goal_achieved = False
+ final_step_success = True
+ last_step_completed = -1
+
+ try:
+ scaling_factor = get_scaling_factor()
+ logger.info(f"Using display scaling factor: {scaling_factor}")
+ except Exception as e:
+ logger.error(f"Failed to get scaling factor: {e}. Assuming 1.")
+ scaling_factor = 1
+
+ # --- Main Loop ---
+ for step in range(max_steps):
+ logger.info(f"\n--- Step {step + 1}/{max_steps} ---")
+ step_start_time = time.time()
+ step_img_prefix = f"step_{step + 1}"
+ current_image: Optional[Image.Image] = None
+ current_elements: List[UIElement] = []
+ screen_dimensions: Optional[Tuple[int, int]] = None
+
+ # 1. Perceive State
+ try:
+ logger.debug("Perceiving current screen state...")
+ self._perception.update()
+ current_elements = self._perception.elements or []
+ current_image = self._perception._last_screenshot
+ screen_dimensions = self._perception.screen_dimensions
+
+ if not current_image or not screen_dimensions:
+ raise RuntimeError("Failed to get valid screenshot or dimensions.")
+ logger.info(f"Perceived state with {len(current_elements)} elements.")
+
+ except Exception as perceive_e:
+ logger.error(f"Perception failed: {perceive_e}", exc_info=True)
+ final_step_success = False
+ break
+
+ # 2. Save State Artifacts (Unchanged)
+ raw_state_path = os.path.join(
+ run_output_dir, f"{step_img_prefix}_state_raw.png"
+ )
+ try:
+ current_image.save(raw_state_path)
+ logger.debug(f"Saved raw state image to {raw_state_path}")
+ except Exception as save_raw_e:
+ logger.warning(f"Could not save raw state image: {save_raw_e}")
+
+ if self._box_drawer:
+ parsed_state_path = os.path.join(
+ run_output_dir, f"{step_img_prefix}_state_parsed.png"
+ )
+ try:
+ img_with_boxes = self._box_drawer(
+ current_image, current_elements, color="lime", show_ids=True
+ )
+ img_with_boxes.save(parsed_state_path)
+ logger.debug(
+ f"Saved parsed state visualization to {parsed_state_path}"
+ )
+ except Exception as draw_boxes_e:
+ logger.warning(f"Could not save parsed state image: {draw_boxes_e}")
+
+ # 3. Plan Action (Unchanged)
+ llm_plan: Optional[LLMActionPlan] = None
+ target_element: Optional[UIElement] = None
+ try:
+ logger.debug("Planning next action...")
+ llm_plan, target_element = self._planner(
+ elements=current_elements,
+ user_goal=goal,
+ action_history=self.action_history,
+ step=step, # 0-based index
+ )
+ # (Logging of plan details remains here)
+ logger.info(f"LLM Reasoning: {llm_plan.reasoning}")
+ logger.info(
+ f"LLM Plan: Action={llm_plan.action}, TargetID={llm_plan.element_id}, GoalComplete={llm_plan.is_goal_complete}"
+ )
+ if llm_plan.text_to_type:
+ logger.info(f"LLM Plan: Text='{llm_plan.text_to_type[:50]}...'")
+ if llm_plan.key_info:
+ logger.info(f"LLM Plan: KeyInfo='{llm_plan.key_info}'")
+
+ except Exception as plan_e:
+ logger.error(f"Planning failed: {plan_e}", exc_info=True)
+ final_step_success = False
+ break
+
+ # 4. Check Goal Completion (Before Action) (Unchanged)
+ if llm_plan.is_goal_complete:
+ logger.success("LLM determined the goal is achieved!")
+ goal_achieved = True
+ last_step_completed = step
+ break
+
+ # 5. Validate Action Requirements (Unchanged)
+ if llm_plan.action == "click" and target_element is None:
+ logger.error(
+ f"Action 'click' planned for element ID {llm_plan.element_id}, but element not found. Stopping."
+ )
+ final_step_success = False
+ break
+
+ # 6. Visualize Planned Action (Unchanged)
+ if self._highlighter and current_image:
+ highlight_img_path = os.path.join(
+ run_output_dir, f"{step_img_prefix}_action_highlight.png"
+ )
+ try:
+ highlighted_image = self._highlighter(
+ current_image,
+ element=target_element,
+ plan=llm_plan,
+ color="red",
+ width=3,
+ )
+ highlighted_image.save(highlight_img_path)
+ logger.debug(f"Saved action visualization to {highlight_img_path}")
+ except Exception as draw_highlight_e:
+ logger.warning(
+ f"Could not save action visualization image: {draw_highlight_e}"
+ )
+
+ # 7. Update Action History (Before Execution) (Unchanged)
+ action_desc = f"Step {step + 1}: Planned {llm_plan.action}"
+ if target_element:
+ action_desc += (
+ f" on ID {target_element.id} ('{target_element.content[:30]}...')"
+ )
+ if llm_plan.text_to_type:
+ action_desc += f" Text='{llm_plan.text_to_type[:20]}...'"
+ if llm_plan.key_info:
+ action_desc += f" Key='{llm_plan.key_info}'"
+ self.action_history.append(action_desc)
+ logger.debug(f"Added to history: {action_desc}")
+
+ # 8. Execute Action (Refactored)
+ logger.info(f"Executing action: {llm_plan.action}...")
+ action_success = False
+ try:
+ handler = self._action_handlers.get(llm_plan.action)
+ if handler:
+ # Pass necessary arguments to the handler
+ action_success = handler(
+ plan=llm_plan,
+ target_element=target_element,
+ screen_dims=screen_dimensions,
+ scaling_factor=scaling_factor,
+ )
+ else:
+ logger.error(
+ f"Execution handler for action type '{llm_plan.action}' not found."
+ )
+ action_success = False
+
+ # Check execution result
+ if not action_success:
+ logger.error(f"Action '{llm_plan.action}' execution failed.")
+ final_step_success = False
+ break
+ else:
+ logger.success("Action executed successfully.")
+
+ except Exception as exec_e:
+ logger.error(
+ f"Exception during action execution: {exec_e}", exc_info=True
+ )
+ final_step_success = False
+ break
+
+ # Mark step as fully completed (Unchanged)
+ last_step_completed = step
+
+ # Wait for UI to settle (Unchanged)
+ time.sleep(1.5)
+ logger.debug(
+ f"Step {step + 1} duration: {time.time() - step_start_time:.2f}s"
+ )
+
+ # --- End of Loop --- (Rest of the method remains the same)
+ logger.info("\n--- Agent Run Finished ---")
+ if goal_achieved:
+ logger.success("Overall goal marked as achieved by LLM.")
+ elif final_step_success and last_step_completed == max_steps - 1:
+ logger.warning(
+ f"Reached maximum steps ({max_steps}) without goal completion."
+ )
+ elif not final_step_success:
+ logger.error(
+ f"Execution stopped prematurely after Step {last_step_completed + 1} due to an error."
+ )
+
+ logger.info("Capturing final screen state...")
+ final_state_img_path = os.path.join(run_output_dir, "final_state.png")
+ try:
+ final_image = take_screenshot()
+ if final_image:
+ final_image.save(final_state_img_path)
+ logger.info(f"Saved final screen state to {final_state_img_path}")
+ else:
+ logger.warning("Could not capture final screenshot.")
+ except Exception as save_final_e:
+ logger.warning(f"Could not save final state image: {save_final_e}")
+
+ logger.info(f"Run artifacts saved in: {run_output_dir}")
+ return goal_achieved
diff --git a/pyproject.toml b/pyproject.toml
index a756430..36c46d8 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -27,15 +27,15 @@ dependencies = [
"requests>=2.31.0", # HTTP requests for OmniParser
"mss>=6.1.0", # Screen capture
"jinja2>=3.0.0", # For templating
- "botocore>=1.37.13", # Keep if needed for OmniParser deployment utils
- "boto3>=1.37.13", # Keep if needed for OmniParser deployment utils
- "paramiko>=3.5.1", # Keep if needed for OmniParser deployment utils
+ "botocore>=1.37.13",
+ "boto3>=1.37.13",
+ "paramiko>=3.5.1",
"pydantic-settings>=2.8.1",
"numpy>=2.2.4",
- # pydantic pulled in by pydantic-settings, but explicit is ok
- "pydantic>=2.10.6",
+ "pydantic>=2.10.6", # pydantic pulled in by pydantic-settings, but explicit is ok
"tenacity>=9.0.0",
- # Removed pytest and pytest-mock from main dependencies
+ # Add platform-specific dependency for macOS
+ "pyobjc-framework-Cocoa; sys_platform == 'darwin'",
]
[project.scripts]
diff --git a/tests/test_agent_executor.py b/tests/test_agent_executor.py
new file mode 100644
index 0000000..3e6e1b1
--- /dev/null
+++ b/tests/test_agent_executor.py
@@ -0,0 +1,428 @@
+# tests/test_agent_executor.py
+
+import os
+from typing import List, Optional, Tuple
+from unittest.mock import MagicMock
+
+import pytest
+from PIL import Image
+
+from omnimcp.agent_executor import (
+ AgentExecutor,
+ PerceptionInterface,
+ ExecutionInterface,
+ PlannerCallable,
+)
+from omnimcp import agent_executor
+from omnimcp.types import LLMActionPlan, UIElement
+
+
+class MockPerception(PerceptionInterface):
+ def __init__(
+ self,
+ elements: List[UIElement],
+ dims: Optional[Tuple[int, int]],
+ image: Optional[Image.Image],
+ ):
+ self.elements = elements
+ self.screen_dimensions = dims
+ self._last_screenshot = image
+ self.update_call_count = 0
+ self.fail_on_update = False # Flag to simulate failure
+
+ def update(self) -> None:
+ if (
+ self.fail_on_update and self.update_call_count > 0
+ ): # Fail on second+ call if requested
+ raise ConnectionError("Mock perception failure")
+ self.update_call_count += 1
+ # Simulate state update if needed, or keep static for simple tests
+
+
+class MockExecution(ExecutionInterface):
+ def __init__(self):
+ self.calls = []
+ self.fail_on_action: Optional[str] = None # e.g., "click" to make click fail
+
+ def click(self, x: int, y: int, click_type: str = "single") -> bool:
+ self.calls.append(("click", x, y, click_type))
+ return not (self.fail_on_action == "click")
+
+ def type_text(self, text: str) -> bool:
+ self.calls.append(("type_text", text))
+ return not (self.fail_on_action == "type")
+
+ def execute_key_string(self, key_info_str: str) -> bool:
+ self.calls.append(("execute_key_string", key_info_str))
+ return not (self.fail_on_action == "press_key")
+
+ def scroll(self, dx: int, dy: int) -> bool:
+ self.calls.append(("scroll", dx, dy))
+ return not (self.fail_on_action == "scroll")
+
+
+# --- Pytest Fixtures ---
+
+
+@pytest.fixture
+def mock_image() -> Image.Image:
+ return Image.new("RGB", (200, 100), color="gray") # Slightly larger default
+
+
+@pytest.fixture
+def mock_element() -> UIElement:
+ return UIElement(id=0, type="button", content="OK", bounds=(0.1, 0.1, 0.2, 0.1))
+
+
+@pytest.fixture
+def mock_perception_component(mock_element, mock_image) -> MockPerception:
+ return MockPerception([mock_element], (200, 100), mock_image)
+
+
+@pytest.fixture
+def mock_execution_component() -> MockExecution:
+ return MockExecution()
+
+
+@pytest.fixture
+def mock_box_drawer() -> MagicMock:
+ return MagicMock(return_value=Image.new("RGB", (1, 1))) # Return dummy image
+
+
+@pytest.fixture
+def mock_highlighter() -> MagicMock:
+ return MagicMock(return_value=Image.new("RGB", (1, 1))) # Return dummy image
+
+
+@pytest.fixture
+def temp_output_dir(tmp_path) -> str:
+ """Create a temporary directory for test run outputs."""
+ # tmp_path is a pytest fixture providing a Path object to a unique temp dir
+ output_dir = tmp_path / "test_runs"
+ output_dir.mkdir()
+ return str(output_dir)
+
+
+# --- Mock Planners ---
+
+
+def planner_completes_on_step(n: int) -> PlannerCallable:
+ """Factory for a planner that completes on step index `n`."""
+
+ def mock_planner(
+ elements: List[UIElement], user_goal: str, action_history: List[str], step: int
+ ) -> Tuple[LLMActionPlan, Optional[UIElement]]:
+ target_element = elements[0] if elements else None
+ is_complete = step == n
+ action = "click" if not is_complete else "press_key" # Vary action
+ element_id = target_element.id if target_element and action == "click" else None
+ key_info = "Enter" if is_complete else None
+
+ plan = LLMActionPlan(
+ reasoning=f"Mock reasoning step {step + 1} for goal '{user_goal}'",
+ action=action,
+ element_id=element_id,
+ key_info=key_info,
+ is_goal_complete=is_complete,
+ )
+ return plan, target_element
+
+ return mock_planner
+
+
+def planner_never_completes() -> PlannerCallable:
+ """Planner that never signals goal completion."""
+
+ def mock_planner(
+ elements: List[UIElement], user_goal: str, action_history: List[str], step: int
+ ) -> Tuple[LLMActionPlan, Optional[UIElement]]:
+ target_element = elements[0] if elements else None
+ element_id = target_element.id if target_element else None
+ plan = LLMActionPlan(
+ reasoning=f"Mock reasoning step {step + 1} for goal '{user_goal}', goal not complete",
+ action="click",
+ element_id=element_id,
+ text_to_type=None,
+ key_info=None,
+ is_goal_complete=False,
+ )
+ return plan, target_element
+
+ return mock_planner
+
+
+def planner_fails() -> PlannerCallable:
+ """Planner that raises an exception."""
+
+ def failing_planner(*args, **kwargs):
+ raise ValueError("Mock planning failure")
+
+ return failing_planner
+
+
+# --- Test Functions ---
+
+
+def test_run_completes_goal(
+ mock_perception_component: MockPerception,
+ mock_execution_component: MockExecution,
+ mock_box_drawer: MagicMock,
+ mock_highlighter: MagicMock,
+ temp_output_dir: str,
+ mocker, # Add mocker fixture
+):
+ """Test a successful run where the goal is completed on the second step."""
+ # --- Add Mock for take_screenshot to avoid $DISPLAY error in CI ---
+ mock_final_image = Image.new("RGB", (50, 50), color="green") # Dummy image
+ mocker.patch.object(
+ agent_executor, "take_screenshot", return_value=mock_final_image
+ )
+ # --- End Mock ---
+
+ complete_step_index = 1
+ executor = AgentExecutor(
+ perception=mock_perception_component,
+ planner=planner_completes_on_step(complete_step_index),
+ execution=mock_execution_component,
+ box_drawer=mock_box_drawer,
+ highlighter=mock_highlighter,
+ )
+
+ result = executor.run(
+ goal="Test Goal", max_steps=5, output_base_dir=temp_output_dir
+ )
+
+ assert result is True, "Should return True when goal is completed."
+ assert (
+ mock_perception_component.update_call_count == complete_step_index + 1
+ ) # Called for steps 0, 1
+ assert (
+ len(mock_execution_component.calls) == complete_step_index
+ ) # Executed only for step 0
+ assert mock_execution_component.calls[0][0] == "click" # Action in step 0
+ assert len(executor.action_history) == complete_step_index
+
+ run_dirs = os.listdir(temp_output_dir)
+ assert len(run_dirs) == 1
+ run_dir_path = os.path.join(temp_output_dir, run_dirs[0])
+ assert os.path.exists(os.path.join(run_dir_path, "step_1_state_raw.png"))
+ assert os.path.exists(os.path.join(run_dir_path, "final_state.png"))
+ assert mock_box_drawer.call_count == complete_step_index + 1
+ assert mock_highlighter.call_count == complete_step_index
+
+
+def test_run_reaches_max_steps(
+ mock_perception_component: MockPerception,
+ mock_execution_component: MockExecution,
+ mock_box_drawer: MagicMock,
+ mock_highlighter: MagicMock,
+ temp_output_dir: str,
+ mocker, # Add mocker fixture for consistency, patch take_screenshot here too
+):
+ """Test reaching max_steps without completing the goal."""
+ # --- Add Mock for take_screenshot to avoid $DISPLAY error in CI ---
+ mock_final_image = Image.new("RGB", (50, 50), color="blue") # Dummy image
+ mocker.patch.object(
+ agent_executor, "take_screenshot", return_value=mock_final_image
+ )
+ # --- End Mock ---
+
+ max_steps = 3
+ executor = AgentExecutor(
+ perception=mock_perception_component,
+ planner=planner_never_completes(),
+ execution=mock_execution_component,
+ box_drawer=mock_box_drawer,
+ highlighter=mock_highlighter,
+ )
+
+ result = executor.run(
+ goal="Test Max Steps", max_steps=max_steps, output_base_dir=temp_output_dir
+ )
+
+ assert result is False, "Should return False when max steps reached."
+ assert mock_perception_component.update_call_count == max_steps
+ assert len(mock_execution_component.calls) == max_steps
+ assert len(executor.action_history) == max_steps
+ assert mock_box_drawer.call_count == max_steps
+ assert mock_highlighter.call_count == max_steps
+ # Also check final state image existence here
+ run_dirs = os.listdir(temp_output_dir)
+ assert len(run_dirs) == 1
+ run_dir_path = os.path.join(temp_output_dir, run_dirs[0])
+ assert os.path.exists(os.path.join(run_dir_path, "final_state.png"))
+
+
+def test_run_perception_failure(
+ mock_perception_component: MockPerception,
+ mock_execution_component: MockExecution,
+ temp_output_dir: str,
+ mocker, # Add mocker fixture
+):
+ """Test that the loop stops if perception fails on the second step."""
+ # --- Add Mock for take_screenshot to avoid $DISPLAY error in CI ---
+ mock_final_image = Image.new("RGB", (50, 50), color="red") # Dummy image
+ mocker.patch.object(
+ agent_executor, "take_screenshot", return_value=mock_final_image
+ )
+ # --- End Mock ---
+
+ mock_perception_component.fail_on_update = True # Configure mock to fail
+ executor = AgentExecutor(
+ perception=mock_perception_component,
+ planner=planner_never_completes(),
+ execution=mock_execution_component,
+ )
+
+ result = executor.run(
+ goal="Test Perception Fail", max_steps=5, output_base_dir=temp_output_dir
+ )
+
+ assert result is False
+ assert (
+ mock_perception_component.update_call_count == 1
+ ) # First call ok, fails during second
+ assert len(mock_execution_component.calls) == 1 # Only first step executed
+ assert len(executor.action_history) == 1
+ # Check final state image existence
+ run_dirs = os.listdir(temp_output_dir)
+ assert len(run_dirs) == 1
+ run_dir_path = os.path.join(temp_output_dir, run_dirs[0])
+ assert os.path.exists(os.path.join(run_dir_path, "final_state.png"))
+
+
+def test_run_planning_failure(
+ mock_perception_component: MockPerception,
+ mock_execution_component: MockExecution,
+ temp_output_dir: str,
+ mocker, # Add mocker fixture
+):
+ """Test that the loop stops if planning fails."""
+ # --- Add Mock for take_screenshot to avoid $DISPLAY error in CI ---
+ mock_final_image = Image.new("RGB", (50, 50), color="yellow") # Dummy image
+ mocker.patch.object(
+ agent_executor, "take_screenshot", return_value=mock_final_image
+ )
+ # --- End Mock ---
+
+ executor = AgentExecutor(
+ perception=mock_perception_component,
+ planner=planner_fails(),
+ execution=mock_execution_component,
+ )
+
+ result = executor.run(
+ goal="Test Planning Fail", max_steps=5, output_base_dir=temp_output_dir
+ )
+
+ assert result is False
+ assert (
+ mock_perception_component.update_call_count == 1
+ ) # Perception called once before planning
+ assert len(mock_execution_component.calls) == 0 # Execution never reached
+ # Check final state image existence
+ run_dirs = os.listdir(temp_output_dir)
+ assert len(run_dirs) == 1
+ run_dir_path = os.path.join(temp_output_dir, run_dirs[0])
+ assert os.path.exists(os.path.join(run_dir_path, "final_state.png"))
+
+
+def test_run_execution_failure(
+ mock_perception_component: MockPerception,
+ mock_execution_component: MockExecution,
+ temp_output_dir: str,
+ mocker, # Add mocker fixture
+):
+ """Test that the loop stops if execution fails."""
+ # --- Add Mock for take_screenshot to avoid $DISPLAY error in CI ---
+ mock_final_image = Image.new("RGB", (50, 50), color="purple") # Dummy image
+ mocker.patch.object(
+ agent_executor, "take_screenshot", return_value=mock_final_image
+ )
+ # --- End Mock ---
+
+ mock_execution_component.fail_on_action = "click" # Make the click action fail
+ executor = AgentExecutor(
+ perception=mock_perception_component,
+ planner=planner_never_completes(), # Planner plans 'click' first
+ execution=mock_execution_component,
+ )
+
+ result = executor.run(
+ goal="Test Execution Fail", max_steps=5, output_base_dir=temp_output_dir
+ )
+
+ assert result is False
+ assert mock_perception_component.update_call_count == 1
+ assert len(mock_execution_component.calls) == 1 # Execution was attempted
+ assert executor.action_history[0].startswith(
+ "Step 1: Planned click"
+ ) # History includes planned action
+ # Check final state image existence
+ run_dirs = os.listdir(temp_output_dir)
+ assert len(run_dirs) == 1
+ run_dir_path = os.path.join(temp_output_dir, run_dirs[0])
+ assert os.path.exists(os.path.join(run_dir_path, "final_state.png"))
+
+
+@pytest.mark.parametrize("scaling_factor", [1, 2])
+def test_coordinate_scaling_for_click(
+ mock_perception_component: MockPerception,
+ mock_element: UIElement,
+ mock_execution_component: MockExecution,
+ temp_output_dir: str,
+ mocker,
+ scaling_factor: int,
+):
+ """Verify coordinate scaling is applied before calling execution.click."""
+ # --- Add Mock for take_screenshot to avoid $DISPLAY error in CI ---
+ # (Not strictly necessary here as loop only runs 1 step, but good practice)
+ mock_final_image = Image.new("RGB", (50, 50), color="orange") # Dummy image
+ mocker.patch.object(
+ agent_executor, "take_screenshot", return_value=mock_final_image
+ )
+ # --- End Mock ---
+
+ planner_click = MagicMock(
+ return_value=(
+ LLMActionPlan(
+ reasoning="Click test",
+ action="click",
+ element_id=mock_element.id,
+ is_goal_complete=False,
+ ),
+ mock_element,
+ )
+ )
+ # Patch get_scaling_factor within the agent_executor module
+ mocker.patch.object(
+ agent_executor, "get_scaling_factor", return_value=scaling_factor
+ )
+
+ executor = AgentExecutor(
+ perception=mock_perception_component,
+ planner=planner_click,
+ execution=mock_execution_component,
+ )
+
+ executor.run(goal="Test Scaling", max_steps=1, output_base_dir=temp_output_dir)
+
+ # Dims: W=200, H=100
+ # Bounds: x=0.1, y=0.1, w=0.2, h=0.1
+ # Center physical x = (0.1 + 0.2 / 2) * 200 = 40
+ # Center physical y = (0.1 + 0.1 / 2) * 100 = 15
+ expected_logical_x = int(40 / scaling_factor)
+ expected_logical_y = int(15 / scaling_factor)
+
+ assert len(mock_execution_component.calls) == 1
+ assert mock_execution_component.calls[0] == (
+ "click",
+ expected_logical_x,
+ expected_logical_y,
+ "single",
+ )
+ # Check final state image existence
+ run_dirs = os.listdir(temp_output_dir)
+ assert len(run_dirs) == 1
+ run_dir_path = os.path.join(temp_output_dir, run_dirs[0])
+ assert os.path.exists(os.path.join(run_dir_path, "final_state.png"))
diff --git a/uv.lock b/uv.lock
index a7bbe82..8ff8a34 100644
--- a/uv.lock
+++ b/uv.lock
@@ -586,6 +586,7 @@ dependencies = [
{ name = "pydantic" },
{ name = "pydantic-settings" },
{ name = "pynput" },
+ { name = "pyobjc-framework-cocoa", marker = "sys_platform == 'darwin'" },
{ name = "requests" },
{ name = "tenacity" },
]
@@ -614,6 +615,7 @@ requires-dist = [
{ name = "pydantic", specifier = ">=2.10.6" },
{ name = "pydantic-settings", specifier = ">=2.8.1" },
{ name = "pynput", specifier = ">=1.7.6" },
+ { name = "pyobjc-framework-cocoa", marker = "sys_platform == 'darwin'" },
{ name = "pytest", marker = "extra == 'test'", specifier = ">=8.0.0" },
{ name = "pytest-asyncio", marker = "extra == 'test'", specifier = ">=0.23.5" },
{ name = "pytest-mock", marker = "extra == 'test'", specifier = ">=3.10.0" },