Skip to content

Add AXUIElement-based actions for semantic GUI automation #86

@NakaokaRei

Description

@NakaokaRei

Summary

SwiftAutoGUI currently controls the GUI exclusively through CGEvent APIs (coordinate-based mouse clicks, keyboard events). While ScreenContext.swift already reads the accessibility tree via AXUIElement for AI context gathering, it does not use the Accessibility API for performing actions.

Adding AXUIElement-based action execution would enable semantic, coordinate-free GUI automation — pressing a button by its label, setting a text field by its role, or selecting a menu item by path — which is more robust and reliable than coordinate-based approaches.

Motivation

Aspect Current (CGEvent) Proposed (AXUIElement)
Targeting Screen coordinates (fragile) Element reference by role/label (robust)
Resolution Must compute exact pixel positions Resolution-agnostic
Dynamic layouts Breaks if UI shifts Finds elements regardless of position
Speed Requires mouse movement animation Direct action, no cursor travel
Reliability Depends on timing/animation state Waits on element existence
AI integration LLM must guess coordinates from screenshots LLM selects elements from structured tree

CGEvent remains essential for apps with poor accessibility support (games, custom-rendered UIs), so this is an additive change — a hybrid approach.

Implementation Plan

Phase 1: AXUIElement Action Primitives

Add core action execution methods to a new file (e.g., AXAction.swift):

  • AXUIElementPerformAction wrappers:

    • pressElement(_ element: AXUIElement) — performs kAXPressAction
    • showMenu(_ element: AXUIElement) — performs kAXShowMenuAction
    • confirm(_ element: AXUIElement) — performs kAXConfirmAction
    • cancel(_ element: AXUIElement) — performs kAXCancelAction
    • increment/decrement(_ element: AXUIElement) — for sliders, steppers
  • AXUIElementSetAttributeValue wrappers:

    • setValue(_ element: AXUIElement, value: String) — set text field value
    • setFocused(_ element: AXUIElement, focused: Bool) — focus an element
    • setPosition/setSize — move/resize windows

Phase 2: Element Search & Resolution

Build on the existing ScreenContextProvider and AXNode infrastructure to find elements:

  • findElement(role:label:) — search the AX tree by role and/or label (e.g., role: "AXButton", label: "OK")
  • findElement(role:value:) — search by role and current value
  • findElementAtPosition(_ point: CGPoint) — use AXUIElementCopyElementAtPosition
  • findElements(role:) — return all matching elements (e.g., all buttons)
  • Support searching within a specific app (by PID or bundle identifier) or across the frontmost app

Important: These methods need to return the raw AXUIElement reference (not just AXNode) so that actions can be performed on them.

Phase 3: High-Level Convenience API

Compose search + action into ergonomic static methods on SwiftAutoGUI:

// Press a button by label
SwiftAutoGUI.pressButton(label: "OK")
SwiftAutoGUI.pressButton(label: "Save", app: "TextEdit")

// Set text field value directly
SwiftAutoGUI.setTextField(label: "Search", value: "hello world")
SwiftAutoGUI.setTextField(role: "AXTextArea", value: "content")

// Select menu item by path
SwiftAutoGUI.selectMenuItem(path: ["File", "Save As..."])
SwiftAutoGUI.selectMenuItem(path: ["Edit", "Find", "Find..."])

// Focus / raise a window
SwiftAutoGUI.raiseWindow(title: "Untitled", app: "TextEdit")

// Check element state
SwiftAutoGUI.isEnabled(role: "AXButton", label: "Submit") // -> Bool
SwiftAutoGUI.getValue(role: "AXTextField", label: "Name") // -> String?

Phase 4: Integrate with Action Enum

Add new Action cases for AX-based operations:

case pressButton(label: String)
case setTextField(label: String, value: String)
case selectMenuItem(path: [String])
case raiseWindow(title: String)

Update Action.execute() to handle these new cases, and update BasicAction / ActionGenerator so the AI backends can emit AX-based actions alongside coordinate-based ones.

Phase 5: CLI Support

Add AX commands to the sagui CLI tool:

sagui ax press --label "OK"             # Press a button
sagui ax set --label "Search" --value "hello"  # Set text field
sagui ax menu "File" "Save As..."       # Select menu item
sagui ax tree                           # Print the AX tree (debugging)
sagui ax find --role AXButton           # List all buttons

Non-Goals

  • Replacing CGEvent-based automation (both approaches coexist)
  • Supporting non-macOS platforms
  • Full accessibility testing framework (e.g., XCUITest replacement)

Technical Notes

  • Requires Accessibility permissions (same as current CGEvent usage)
  • AXUIElement is not Sendable — need to handle carefully with Swift 6 concurrency
  • Element references are ephemeral — they become invalid when the UI changes, so search-then-act should be atomic where possible
  • All core APIs are in ApplicationServices/HIServices (no additional dependencies needed)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions