Skip to content

Add structured app-control actions to BasicAction (AppleScript-backed) #82

@NakaokaRei

Description

@NakaokaRei

Summary

Currently, AI-generated actions (BasicAction) are limited to low-level mouse/keyboard operations. Many common automation tasks — opening URLs, launching apps, getting window information — require fragile multi-step sequences (move to dock icon → click → wait → move to address bar → click → type URL → press Enter). These break when screen layouts change.

macOS provides AppleScript as a robust, coordinate-independent way to control applications. By adding structured, parameterized app-control actions to BasicAction, the AI agent can accomplish high-level tasks in a single action instead of chaining error-prone mouse/keyboard steps.

Motivation

Task Current approach (mouse/keyboard) With app-control actions
Open a URL move → click browser → wait → move to address bar → click → type URL → enter openURL(url: "https://...")
Launch/activate an app move to Dock → click (or Spotlight → type → enter) activateApp(name: "Safari")
Get frontmost app name Not possible getFrontmostApp → returns info for decision-making

Key benefits:

  • Reliability: No dependency on screen coordinates or UI layout
  • Efficiency: Single action instead of 3-7 step sequences
  • Capability: Enables read-back (window titles, app states) that pure mouse/keyboard cannot achieve

Proposed New Actions

Tier 1 — High value, simple parameters, safe

Action Parameters What it does AppleScript backing
openURL url: String Opens a URL in the default browser open location "..."
activateApp name: String Brings an app to front (launches if needed) tell application "..." to activate
quitApp name: String Gracefully quits an app tell application "..." to quit
getFrontmostApp none Returns the name of the frontmost application tell application "System Events" to get name of first process whose frontmost is true

Tier 2 — Medium complexity, very useful

Action Parameters What it does AppleScript backing
typeInApp appName: String, text: String Activate app then type — combines two common steps activate + keystroke
getWindowList appName: String? Returns list of open window titles System Events windows query
setVolume level: Int (0-100) Set system output volume set volume output volume N

Tier 3 — Powerful but needs careful scoping

Action Parameters What it does Notes
tellApp appName: String, command: String Send a specific command to an app See Security section — this is the most flexible but also most dangerous

Security Considerations

The key design principle: the AI generates structured parameters, not arbitrary AppleScript code.

  • Tier 1-2 actions are safe by construction — they map to fixed AppleScript templates with only string/number parameters injected
  • Parameter sanitization is needed to prevent injection (e.g., appName containing " & do shell script "..."). Use NSAppleScript parameter binding or strict validation (alphanumeric + spaces only for app names, URL scheme validation for URLs)
  • tellApp (Tier 3) is intentionally deferred — it requires either:
    • An allowlist of known-safe commands per app, or
    • User confirmation before execution, or
    • A separate "unsafe" action set that must be explicitly opted into

Entitlements

AppleScript execution already requires specific entitlements (documented in AppleScript.swift):

  • com.apple.security.app-sandbox = false
  • com.apple.security.automation.apple-events = true
  • NSAppleEventsUsageDescription in Info.plist

These are existing requirements for Action.executeAppleScript. The new actions would share the same requirements and should be clearly documented.

Implementation Plan

Phase 1: Tier 1 actions (this issue)

Files to modify:

  • Sources/SwiftAutoGUI/ActionGenerator.swift
    • Add openURL, activateApp, quitApp, getFrontmostApp cases to BasicAction
    • Update toAction() to map each to Action.executeAppleScript(...) with template strings
    • Update Codable conformance (CodingKeys, ActionType, init(from:), encode(to:))
    • Add parameter sanitization helper (validate app names, URLs)
  • Sources/SwiftAutoGUI/OpenAIBackend.swift — Update system prompt and JSON schema
  • Sources/SwiftAutoGUI/OpenAIVisionBackend.swift — Update system prompt, parseAction, JSON schema, describeBasicAction
  • Tests/SwiftAutoGUITests/ActionGeneratorTests.swift — Add tests for new action types
  • FoundationModelsBackend.swift — No changes needed (@Generable auto-synthesizes)

Phase 2: Tier 2 actions (follow-up PR)

Phase 3: Evaluate Tier 3 with safety mechanism (separate discussion)

Design Notes

  • All Tier 1 parameters are String → fully @Generable compatible
  • The toAction() conversion constructs AppleScript strings internally — BasicAction stays clean and the AppleScript details are an implementation concern
  • getFrontmostApp returns a value, which is useful for agentic loops where the AI can observe state and decide next actions. The return value flows through Action.executeAppleScript's existing String? return path
  • This is complementary to Expand BasicAction types to make the Agent smarter #80 (composite mouse/keyboard actions) — both make the agent smarter but in orthogonal ways

Example: How toAction() would work

case .openURL(let url):
    // Validate URL
    guard let _ = URL(string: url), url.hasPrefix("http://") || url.hasPrefix("https://") else {
        return .wait(0) // no-op for invalid URLs
    }
    return .executeAppleScript("open location \"\(url)\"")

case .activateApp(let name):
    let sanitized = name.filter { $0.isLetter || $0.isNumber || $0 == " " || $0 == "." }
    return .executeAppleScript("tell application \"\(sanitized)\" to activate")

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions