Skip to content

Expand BasicAction types to make the Agent smarter #80

@NakaokaRei

Description

@NakaokaRei

Summary

Currently BasicAction only supports 10 action types. The Agent often needs to combine multiple steps for common operations (e.g., move + leftClick to click a UI element), which introduces errors when the screen changes between steps.

Adding 6 frequently-used composite/convenience actions will make the Agent more reliable and efficient.

Proposed New Actions

Action Parameters Rationale Maps to
clickAt x: Double, y: Double Atomic move+click — the most common GUI operation. Eliminates race condition between separate move and click. Action.sequence([.move(to:), .leftClick])
tripleClick none Select entire line/paragraph. selectAll is too broad when targeting a specific line. Action.tripleClick()
typeAndEnter text: String Type text and press Enter — very common for search bars, URL bars, form fields. Action.typeAndEnter(text)
copy none Command+C. More explicit than keyShortcut(["command", "c"]) — LLMs generate it more reliably. Action.copy()
paste none Command+V. Same rationale as copy. Action.paste()
selectAll none Command+A. Same rationale as copy. Action.selectAll()

Files to Modify

  • Sources/SwiftAutoGUI/ActionGenerator.swift — Add 6 enum cases, update toAction(), Codable conformance, and ActionType enum
  • Sources/SwiftAutoGUI/OpenAIVisionBackend.swift — Update system prompt, parseAction, JSON schema, describeBasicAction
  • Sources/SwiftAutoGUI/OpenAIBackend.swift — Update system prompt (JSON schema auto-inherits)
  • Tests/SwiftAutoGUITests/ActionGeneratorTests.swift — Add round-trip, parsing, and conversion tests
  • FoundationModelsBackend.swift — No changes needed (@Generable handles new cases automatically)

Design Notes

  • All new cases use only String and Double parameter types → fully @Generable compatible
  • No new JSON schema properties needed: clickAt reuses x/y, typeAndEnter reuses text
  • System prompt should recommend clickAt over move + leftClick for clicking UI elements
  • Total action count goes from 10 → 16, still lightweight for on-device model context window

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions