Summary
SwiftAutoGUI currently controls the GUI exclusively through CGEvent APIs (coordinate-based mouse clicks, keyboard events). While ScreenContext.swift already reads the accessibility tree via AXUIElement for AI context gathering, it does not use the Accessibility API for performing actions.
Adding AXUIElement-based action execution would enable semantic, coordinate-free GUI automation — pressing a button by its label, setting a text field by its role, or selecting a menu item by path — which is more robust and reliable than coordinate-based approaches.
Motivation
| Aspect |
Current (CGEvent) |
Proposed (AXUIElement) |
| Targeting |
Screen coordinates (fragile) |
Element reference by role/label (robust) |
| Resolution |
Must compute exact pixel positions |
Resolution-agnostic |
| Dynamic layouts |
Breaks if UI shifts |
Finds elements regardless of position |
| Speed |
Requires mouse movement animation |
Direct action, no cursor travel |
| Reliability |
Depends on timing/animation state |
Waits on element existence |
| AI integration |
LLM must guess coordinates from screenshots |
LLM selects elements from structured tree |
CGEvent remains essential for apps with poor accessibility support (games, custom-rendered UIs), so this is an additive change — a hybrid approach.
Implementation Plan
Phase 1: AXUIElement Action Primitives
Add core action execution methods to a new file (e.g., AXAction.swift):
Phase 2: Element Search & Resolution
Build on the existing ScreenContextProvider and AXNode infrastructure to find elements:
findElement(role:label:) — search the AX tree by role and/or label (e.g., role: "AXButton", label: "OK")
findElement(role:value:) — search by role and current value
findElementAtPosition(_ point: CGPoint) — use AXUIElementCopyElementAtPosition
findElements(role:) — return all matching elements (e.g., all buttons)
- Support searching within a specific app (by PID or bundle identifier) or across the frontmost app
Important: These methods need to return the raw AXUIElement reference (not just AXNode) so that actions can be performed on them.
Phase 3: High-Level Convenience API
Compose search + action into ergonomic static methods on SwiftAutoGUI:
// Press a button by label
SwiftAutoGUI.pressButton(label: "OK")
SwiftAutoGUI.pressButton(label: "Save", app: "TextEdit")
// Set text field value directly
SwiftAutoGUI.setTextField(label: "Search", value: "hello world")
SwiftAutoGUI.setTextField(role: "AXTextArea", value: "content")
// Select menu item by path
SwiftAutoGUI.selectMenuItem(path: ["File", "Save As..."])
SwiftAutoGUI.selectMenuItem(path: ["Edit", "Find", "Find..."])
// Focus / raise a window
SwiftAutoGUI.raiseWindow(title: "Untitled", app: "TextEdit")
// Check element state
SwiftAutoGUI.isEnabled(role: "AXButton", label: "Submit") // -> Bool
SwiftAutoGUI.getValue(role: "AXTextField", label: "Name") // -> String?
Phase 4: Integrate with Action Enum
Add new Action cases for AX-based operations:
case pressButton(label: String)
case setTextField(label: String, value: String)
case selectMenuItem(path: [String])
case raiseWindow(title: String)
Update Action.execute() to handle these new cases, and update BasicAction / ActionGenerator so the AI backends can emit AX-based actions alongside coordinate-based ones.
Phase 5: CLI Support
Add AX commands to the sagui CLI tool:
sagui ax press --label "OK" # Press a button
sagui ax set --label "Search" --value "hello" # Set text field
sagui ax menu "File" "Save As..." # Select menu item
sagui ax tree # Print the AX tree (debugging)
sagui ax find --role AXButton # List all buttons
Non-Goals
- Replacing CGEvent-based automation (both approaches coexist)
- Supporting non-macOS platforms
- Full accessibility testing framework (e.g., XCUITest replacement)
Technical Notes
- Requires Accessibility permissions (same as current CGEvent usage)
AXUIElement is not Sendable — need to handle carefully with Swift 6 concurrency
- Element references are ephemeral — they become invalid when the UI changes, so search-then-act should be atomic where possible
- All core APIs are in
ApplicationServices/HIServices (no additional dependencies needed)
References
Summary
SwiftAutoGUI currently controls the GUI exclusively through CGEvent APIs (coordinate-based mouse clicks, keyboard events). While
ScreenContext.swiftalready reads the accessibility tree viaAXUIElementfor AI context gathering, it does not use the Accessibility API for performing actions.Adding AXUIElement-based action execution would enable semantic, coordinate-free GUI automation — pressing a button by its label, setting a text field by its role, or selecting a menu item by path — which is more robust and reliable than coordinate-based approaches.
Motivation
CGEvent remains essential for apps with poor accessibility support (games, custom-rendered UIs), so this is an additive change — a hybrid approach.
Implementation Plan
Phase 1: AXUIElement Action Primitives
Add core action execution methods to a new file (e.g.,
AXAction.swift):AXUIElementPerformActionwrappers:pressElement(_ element: AXUIElement)— performskAXPressActionshowMenu(_ element: AXUIElement)— performskAXShowMenuActionconfirm(_ element: AXUIElement)— performskAXConfirmActioncancel(_ element: AXUIElement)— performskAXCancelActionincrement/decrement(_ element: AXUIElement)— for sliders, steppersAXUIElementSetAttributeValuewrappers:setValue(_ element: AXUIElement, value: String)— set text field valuesetFocused(_ element: AXUIElement, focused: Bool)— focus an elementsetPosition/setSize— move/resize windowsPhase 2: Element Search & Resolution
Build on the existing
ScreenContextProviderandAXNodeinfrastructure to find elements:findElement(role:label:)— search the AX tree by role and/or label (e.g.,role: "AXButton", label: "OK")findElement(role:value:)— search by role and current valuefindElementAtPosition(_ point: CGPoint)— useAXUIElementCopyElementAtPositionfindElements(role:)— return all matching elements (e.g., all buttons)Important: These methods need to return the raw
AXUIElementreference (not justAXNode) so that actions can be performed on them.Phase 3: High-Level Convenience API
Compose search + action into ergonomic static methods on
SwiftAutoGUI:Phase 4: Integrate with Action Enum
Add new
Actioncases for AX-based operations:Update
Action.execute()to handle these new cases, and updateBasicAction/ActionGeneratorso the AI backends can emit AX-based actions alongside coordinate-based ones.Phase 5: CLI Support
Add AX commands to the
saguiCLI tool:Non-Goals
Technical Notes
AXUIElementis notSendable— need to handle carefully with Swift 6 concurrencyApplicationServices/HIServices(no additional dependencies needed)References
Sources/SwiftAutoGUI/ScreenContext.swift(read-only AX tree traversal)