Summary
Currently BasicAction only supports 10 action types. The Agent often needs to combine multiple steps for common operations (e.g., move + leftClick to click a UI element), which introduces errors when the screen changes between steps.
Adding 6 frequently-used composite/convenience actions will make the Agent more reliable and efficient.
Proposed New Actions
| Action |
Parameters |
Rationale |
Maps to |
clickAt |
x: Double, y: Double |
Atomic move+click — the most common GUI operation. Eliminates race condition between separate move and click. |
Action.sequence([.move(to:), .leftClick]) |
tripleClick |
none |
Select entire line/paragraph. selectAll is too broad when targeting a specific line. |
Action.tripleClick() |
typeAndEnter |
text: String |
Type text and press Enter — very common for search bars, URL bars, form fields. |
Action.typeAndEnter(text) |
copy |
none |
Command+C. More explicit than keyShortcut(["command", "c"]) — LLMs generate it more reliably. |
Action.copy() |
paste |
none |
Command+V. Same rationale as copy. |
Action.paste() |
selectAll |
none |
Command+A. Same rationale as copy. |
Action.selectAll() |
Files to Modify
Sources/SwiftAutoGUI/ActionGenerator.swift — Add 6 enum cases, update toAction(), Codable conformance, and ActionType enum
Sources/SwiftAutoGUI/OpenAIVisionBackend.swift — Update system prompt, parseAction, JSON schema, describeBasicAction
Sources/SwiftAutoGUI/OpenAIBackend.swift — Update system prompt (JSON schema auto-inherits)
Tests/SwiftAutoGUITests/ActionGeneratorTests.swift — Add round-trip, parsing, and conversion tests
FoundationModelsBackend.swift — No changes needed (@Generable handles new cases automatically)
Design Notes
- All new cases use only
String and Double parameter types → fully @Generable compatible
- No new JSON schema properties needed:
clickAt reuses x/y, typeAndEnter reuses text
- System prompt should recommend
clickAt over move + leftClick for clicking UI elements
- Total action count goes from 10 → 16, still lightweight for on-device model context window
Summary
Currently
BasicActiononly supports 10 action types. The Agent often needs to combine multiple steps for common operations (e.g.,move+leftClickto click a UI element), which introduces errors when the screen changes between steps.Adding 6 frequently-used composite/convenience actions will make the Agent more reliable and efficient.
Proposed New Actions
clickAtx: Double, y: DoubleAction.sequence([.move(to:), .leftClick])tripleClickselectAllis too broad when targeting a specific line.Action.tripleClick()typeAndEntertext: StringAction.typeAndEnter(text)copykeyShortcut(["command", "c"])— LLMs generate it more reliably.Action.copy()pastecopy.Action.paste()selectAllcopy.Action.selectAll()Files to Modify
Sources/SwiftAutoGUI/ActionGenerator.swift— Add 6 enum cases, updatetoAction(), Codable conformance, andActionTypeenumSources/SwiftAutoGUI/OpenAIVisionBackend.swift— Update system prompt,parseAction, JSON schema,describeBasicActionSources/SwiftAutoGUI/OpenAIBackend.swift— Update system prompt (JSON schema auto-inherits)Tests/SwiftAutoGUITests/ActionGeneratorTests.swift— Add round-trip, parsing, and conversion testsFoundationModelsBackend.swift— No changes needed (@Generablehandles new cases automatically)Design Notes
StringandDoubleparameter types → fully@GenerablecompatibleclickAtreusesx/y,typeAndEnterreusestextclickAtovermove+leftClickfor clicking UI elements